DeExcelerator: A Framework for Extracting Relational Data From Partially Structured Documents uri icon

Open Access

  • false

Peer Reviewed

  • false

Abstract

  • Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data. gov , but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data. gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables.

Veröffentlichungszeitpunkt

  • Januar 1, 2013