Monday, March 10, 2008

Liveblogging from Dagstuhl: Day 1 (Mar 10)

For some strange reason I was invited to Dagstuhl for a seminar on "Ranked XML Querying". Why strange? Because I have done nothing on ranked XML querying, or XML at all. I have done some work in ranking (the Economining project is all about economically-induced ranking at its very core) but why worrying? You cannot say no to an invitation!

So, after a very very bumpy flight from New York to Paris, a train tip from Paris to Saarbrucken, and a cab ride from Saarbrucken to Dagstuhl, here I am. Dagstuhl is a very interesting place, everything based on an honor system (you keep track yourself of the snacks, beers, wines, etc that you consume), and the rooms do not even have keys.

We started today, with a set of tutorials and short 5-minute introductions of the research that each one of us is doing. There were 3 sets: the "database" people, the "IR" people, and the "web" people. I was classified in the "web" track. :-)

The "database" people.
  • Ihab Ilyas: RankSQL, uncertain databases, especially interesting a bullet on "probabilistic data cleaning" that uses model lineage.
  • David Toman: Description logics and query combination. Description logic for physical design. How to create a query optimizer using fine-grained information about physical design.
  • Harald Schöning: Talked about Tamino, "the first native XML database" and applications for police information systems, airport logistics, financial derivatives fleet management, and newsroom system.
  • Peter Apers: Described early 1990's efforts to bring together database and IR communities. In retrospect, he considered a problem the fact that as database community we forgot about the hierarchical and network models, and now we need to talk to IR and web that really use such models for information representation.
  • Benny Kimelfeld: Keyword search over structured databases and how to do flexible and inexact queries over structured databases and how to query probabilistic data
  • Emiran Curtmola: Optimization of XML queries and XQFT (XML full text) queries. How to rank and evaluate the quality of search results; how to summarize such results.
  • Stefan Klinger: Graph theory and XML schema validation. Started working on the Pathfinder compiler that converts XML to relational expressions; extends the PathFinder compliler to full-text.
  • Kostas Stefanidis: Personalized systems with application in personalized search, how to manage context-dependent preferences, database selection based on contextual preferences.
  • Irini Foundoulaki: Personalized XML full text search and experiments with INEX data. XML Access control and how to formalize the semantics and apply them; security for provenance data.
  • Amelie Marian: Data corroboration: large amount of low-quality data, and use of corroboration can improve the quality; Understanding user reviewing patterns (structure and query reviews); Multi-dimensional search for file systems.
  • Gerhard Weikum: How to turn the web into a semantic database: Harvest and combine data (a) hand-crafted data, (b) automatic knowledge extraction, (c) social networks and human computing. The Yago system, NAGA queries. Plus: p2p search, personalized search, social search, time-travel search on web archives.
  • Ralf Schenkel: TopX, bridging the DB and IR gap. XML query languages for real users.


Ranked XML Quyerying: The DB Tutorial (Weikum)
Started with a quadrant: (structured vs unstructured, search & data):

  • Both structured: Databases
  • Both unstructured: IR
  • Structured search, unstructured data: information extraction and text mining workflows.
  • Unstructured search, structured data: keyword search over relational and XML data.
  • Motivation 1: Text matching: Add keyword search for searching relational and XML data. We need the (principled) ranking approaches for result ranking. We also need probabilistic integration of different relations. Question: what defines a ranking function as "principled"? Answer: tf.idf is not "principled" but adhoc performs well, language models, BM25, and so on, are built on theoretical models and can be reused in different contexts. XML and searching: XPath and similar languages add multiple predicates. Typically we cannot satisfy them all (plus, they are difficult to write them in a semantically correct manner). Therefore we need relaxation.
  • Motivation 2: Too-many-answers. We resort to "top-K" or skyline (Pareto optimal). Probabilistic ranking for SQL and how to adopt the likelihood model for SQL ranking. How to fit together deterministic predicates with "soft" predicates.
  • Motivation 3: Schema relaxation. We can relax not only content queries but schema as well.
  • Motivation 4: Information Extraction and Entity Search. We can extract our data and built (uncertain) tables from the data. How can we extract and query, and rank such results in an efficient and effective manner? If we take a graph-based approach, with multiple link types, how can we effectively exploit the generated network? We can rank by confidence, by informativeness, or even by compactness (Steiner tree).
Lunch: We were sitting in prearranged tables with our names assigned to prespecified seats, randomly assigned, to encourage/force interaction.

The "IR" people.

We continued with the introduction of the IR people.
  • Holger Bast: All data is text (In the beginning was the word...). All text is semi-structured. MAke fancy searches fast and easy to use. Demo of CompleteSearch of DBLP (impressive!) and of FacetedDBLP.
  • Maarten Marx: NEXI query language, doing XML retrieval IR-first.
  • Martin Theobald: Probabilistic databases (uncertainty and lineage, Trio project). Efficient XML-IR. TopX system, plus call for INEX.
  • Djoerd Hiemstra: IR language models, multimedia and XML& Entity Search. PathFinder/Tijah.
  • Yosi Mass: XML Query and XML fragments. Vector space model for XML ranking and relevance feedback for XML. Desktop search and UIMA annotations.
  • Arjen de Vries: Improve search system engineering efficiency. Given a declarative specification of the collection, background, context, and of a retrieval model, generate a "Parameterized Search System" (PSS).
  • Thomas Rölleke: Seamless DB+IR, HySpirit retrieval engine.
  • Ingo Frommholz: Annotations and meta-annotations (annotations on annotations). Searching documents with annotations, or doing discussion search (finding documents that get positive comments(?)).
IR Tutorial by Djoerd Hiemstra: History of IR developments: STAIRS, introduction of GML (separation of content from formatting), Codd's relational model,... Discussion of INEX plus some experimental results. Discussion of LM, BM25, etc.

The "Web" people.
  • Sihem Amer-Yahia: Her story from monolithic to atheist to agnostic, all in terms of data management. Serving socially relevant content to users (e.g., what I enjoy to watch, depending on the company). She plagiarized the "show me the money" slogan!
  • Pierre Senellart: Research on the hidden web. Discovery of web services, probing and wrapper induction.
  • Sebastian Michel: P2P web search, distributed indexing, social search.
  • Debora Donato: NLP applied to IR, Usage and Link Analysis. Mining social networks, web spam, reputation management.
  • Panos Ipeirotis: Yours truly. SQoUT, EconoMining, Noisy multilabeling, faceted interfaces etc.
Tutorial from Sihem: Making DB&IR socially meaningful. Talked about recommendations: why, when, dealing with long tails, time-awareness, diversity-awareness.