Tuesday, March 11, 2008

Liveblogging from Dagstuhl: Day 2 (Mar 11)

  • Holger Bast: The CompleteSearch Engine http://search.mpi-inf.mpg.de/. IR vs DB: IR index is compressible and high locality of access, ranks well, not even simple selects. DB vs IR: can query nicely, no locality of access. CompleteSearch: Performs prefix search and range search on the IR search. Ability to perform joins using a keyword search interface. Locality of reference claimed to be the main advantage of using IR-based indexes instead of "traditional" database indexes. The whole architecture seems similar (in terms of benefits) to the column-store database systems.
  • Arjen de Vries: Flexible and Efficient IR using Array Databases. Many standalone retrieval prototypes, no clean separation of the different aspects of the experiments, and things are typically monolithic and tied to specific datasets. Goal is to have flexibility and efficiency. The idea is to specify the type of documents, as a set of matrices (make sure to compress them). Then define a set of metrics using the matrix data. Then combine the metrics and matrices inro database queries and be able to have an engine to run experiments efficiently in a data-independent manner. (So that we do not have to reinvent the wheel every time that we want to do something new.)
  • Yosi Mass: Adaptive XML Retrieval System. Given a query in free text, retrieve XML components that satisfy the query. One approach is first to retrieve documents and then score the fragments within. Second approach: index only XML leaves (need to perform aggregations for retrieving more complex elements) Third approach: index every possible subtree (overlapping of items, an issue when computing frequencies). Solution: split elements into multiple indexes, making sure that we have complete coverage and no overlap of elements within the same index. (Comments indicate that is good idea when the number of tags is small, to group all similar tags to the same index, instead of mixing apples and organges, or "chapter" and "section" tags. This becomes a problem with Wikipedia, when we start having too many tags, and is not possible to generate that many indexes --- what about grouping tags together to populate)
  • Djoerd Hiemstra: Sound Ranking Algorithms for XML Search. Pathfinder: XQuery->Relational compiler. Tijah=XML search system for NEXI (Narrow Extended Path). NEXI is being used as sublanguage to XQuery. Need to devise metrics that will allow consistent rankings
Break
  • Amelie Marian: Filesystem seach. Keyword search for ranking, and filters on metadata. Pure IR model not sufficient, due to the need for "fuzzy predicates" (e.g., "get me my proposals from around March 2006). Needs to accomodate approximate predicates naturally, going beyond "binary" in the features. Proposed a multidimensional approach, scoring each "field" independently, and aggregating the scores afterwards. Contributions in query processing: multiple indexes and DAG-based approaches. Used relaxation hierarchies for allowing relaxation of predicates (day to month to year...).
  • Kostas Stefanidis: Get best results based on contextual user preferences. Give the best contextual results by inferring the context from the query itself. Implementation: use of a profile tree. Relaxation using hierarchies.
  • Irini Fundulaki: Personalized XML (Pimento). XML queries are both on structure and content. Therefore customize query content with this in mind, and customize the results appropriately. Add scoping rules in the user profiles (for what the query should contain -- with some relaxation) and ordering rules on how the results should be preferably ordered. Described how to achieve efficiently and effectively the relaxation. Deriving rules from narratives.
Break
  • Maarten Marx: Talked about the use of named entity recognizers to create a graph that can help in various tasks, plus makes possible to generate concise summaries of a topic.
  • Benny Kimelfeld: Keyword proximity on XML graphs
  • Emiran Curtmola: XML Distributed Retrieval. When we have XML documents distributed and we want to run queries over them, we can have one centralized model where a central server gets the queries and asks for all documents to be aggregated in a single location. Emiran describes a distributed system using an overlay network.