Synthesis and Machine Learning for Heterogeneous Extraction
We present a way to combine techniques from the program synthesis and machine learning communities to extract structured information from heterogeneous data. Such problems arise in several situations such as extracting attributes from web pages, machine-generated emails, or from data obtained from multiple sources. Our goal is to extract a set of structured attributes from such data.
We use machine learning models ("ML models") such as conditional random fields to get an initial labeling of potential attribute values. However, such models are typically not interpretable, and the noise produced by such models is hard to manage or debug. We use (noisy) labels produced by such ML models as inputs to program synthesis, and generate interpretable programs that cover the input space. We also employ type specifications (called "field constraints") to certify well-formedness of extracted values. Using synthesized programs and field constraints, we re-train the ML models with improved confidence on the labels. We then use these improved labels to re-synthesize a better set of programs. We iterate the process of re-synthesizing the programs and re-training the ML models, and find that such an iterative process improves the quality of the extraction process. This iterative approach, called HDEF, is novel, not only the in way it combines the ML models with program synthesis, but also in the way it adapts program synthesis to deal with noise and heterogeneity.
More broadly, our approach points to ways by which machine learning and programming language techniques can be combined to get the best of both worlds — handling noise, transferring signals from one context to another using ML, producing interpretable programs using PL, and minimizing user intervention.
Mon 24 JunDisplayed time zone: Tijuana, Baja California change
14:00 - 15:30 | |||
14:00 20mTalk | Resource-Guided Program Synthesis PLDI Research Papers Tristan Knoth University of California at San Diego, USA, Di Wang Carnegie Mellon University, Nadia Polikarpova University of California, San Diego, Jan Hoffmann Carnegie Mellon University Media Attached | ||
14:20 20mTalk | Using Active Learning to Synthesize Models of Applications That Access Databases PLDI Research Papers Jiasi Shen Massachusetts Institute of Technology, Martin C. Rinard Massachusetts Institute of Technology DOI Media Attached | ||
14:40 20mTalk | Synthesizing Database Programs for Schema Refactoring PLDI Research Papers Yuepeng Wang University of Texas at Austin, James Dong University of Texas at Austin, USA, Rushi Shah UT Austin, Işıl Dillig UT Austin Media Attached | ||
15:00 20mTalk | Synthesis and Machine Learning for Heterogeneous Extraction PLDI Research Papers Arun Iyer Microsoft Research, India, Manohar Jonnalagedda Inpher Inc., Switzerland, Suresh Parthasarathy Microsoft Research, India, Arjun Radhakrishna Microsoft, Sriram Rajamani Microsoft Research Media Attached |