Synthesis and Machine Learning for Heterogeneous Extraction (PLDI 2019 - PLDI Research Papers)

Who

Arun Iyer, Manohar Jonnalagedda, Suresh Parthasarathy, Arjun Radhakrishna, Sriram Rajamani

Track

PLDI 2019 PLDI Research Papers

Time Zone

The program is currently displayed in (GMT-07:00) Tijuana, Baja California.

Use conference time zone: (GMT-07:00) Tijuana, Baja CaliforniaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 24 Jun 2019 15:00 - 15:20 at 229AB - Synthesis Chair(s): Nuno P. Lopes

Abstract

We present a way to combine techniques from the program synthesis and machine learning communities to extract structured information from heterogeneous data. Such problems arise in several situations such as extracting attributes from web pages, machine-generated emails, or from data obtained from multiple sources. Our goal is to extract a set of structured attributes from such data.

We use machine learning models ("ML models") such as conditional random fields to get an initial labeling of potential attribute values. However, such models are typically not interpretable, and the noise produced by such models is hard to manage or debug. We use (noisy) labels produced by such ML models as inputs to program synthesis, and generate interpretable programs that cover the input space. We also employ type specifications (called "field constraints") to certify well-formedness of extracted values. Using synthesized programs and field constraints, we re-train the ML models with improved confidence on the labels. We then use these improved labels to re-synthesize a better set of programs. We iterate the process of re-synthesizing the programs and re-training the ML models, and find that such an iterative process improves the quality of the extraction process. This iterative approach, called HDEF, is novel, not only the in way it combines the ML models with program synthesis, but also in the way it adapts program synthesis to deal with noise and heterogeneity.

More broadly, our approach points to ways by which machine learning and programming language techniques can be combined to get the best of both worlds — handling noise, transferring signals from one context to another using ML, producing interpretable programs using PL, and minimizing user intervention.

Arun Iyer

Microsoft Research, India

Manohar Jonnalagedda

Inpher Inc., Switzerland

Suresh Parthasarathy

Microsoft Research, India

Arjun Radhakrishna

Microsoft

Sriram Rajamani

Microsoft Research

Video Abstract