Thursday, March 13, 2008

Liveblogging from Dagstuhl: Day 4 (March 13)

  • Stefan Klinger: PathfinderFT (full text) and how to propagate scores in the engine. Based on the XML2relational XQuery Pathfinder engine.
  • Martin Theobald: TopX 2.0. Object store for top-k query processing. Supports BM25 for full text, and employs many IR optimization techniques for speeding up query execution. The 2.0 version implements inverted indexes for XML and various optimizations.
  • Ralf Schenkel: Extended the discussion about TopX
Break
  • Mariano Consens: Why retrieval effectiveness measures matter. In DB we measure efficiency, scalability, simplicity, elegance, but rarely effectiveness (yours truly begs to differ, but I was not even classified in DB to start with :-). In INEX you need to retrieve a ranked list of *non-overlapping* elements. Therefore, in the results it makes sense to eliminate overlaps. Since we assign a "monotonic" measure of relevance in the atomic elements the parent, container models will have a relevance that depends on the relevance of the leaf items.
  • Harold Schoning: Discussion on implementing full text search on Tamino and other interesting topics. Need to check in more detail.
Break
  • Pierre Senellart: Using CRF's for generating automatically wrappers for hidden web databases. Using a tree-based probabilistic model to model dependencies between annotations and assumes conditional independence. Using an iterative approach for enriching the description of the wrappers. Identifies types of important entities, learns how they are connected and constructs a wrapper.
  • Ihab Ilyas: Uncertainty-aware top-K. Generate possible worlds for the instantiation of each relation, and compute the probability of each world. For example, in an information extraction scenario, we can define probability of existence for each tuple, and define the possible "world instantiations" of the relation, together with some "world probability". Now when we want to generate the most probable world, we can take either a Maximum Likelihood approach and return the most probable world, or take a Bayesian approach and integrate across worlds. One issue is how to do the integration efficiently and the presented research describes a few algorithms under different scenarios.
  • Thomas Rolleke: Describe the different retrieval layers in a set of abstractions. How to built a probabilistic system. Nice overview of literature for probabilistic approaches in DB and IR communities, plus overview of approaches that try to connect the two. Discussion on how to implement all the different retrieval models or IR (log-likelihood, vector space, language models, etc) in SQL.
  • Ingo Frommholz: The POLAR framework. How to use annotations in a principled, probabilistic manner.
Break
  • Yours truly: How to structure and rank opinions using econometrics. Essentially, instead of relying on semantics, just associate opinion phrases with some measurable economic variable and discover correlations. Most of the time you need the correct econometric model (aka correct statistical techniques) to get proper results.
  • Ranking Wikipedia using the structural (graph) connections. Personalized PageRank applied for Wikipedia retrieval.