Federated Content Search (CLARIN-FCS)

The goal of the CLARIN Federated Content Search (CLARIN- ) - Core specification is to introduce an interface specificationthat decouples the search enginefunctionality from its exploitation, i.e. user-interfaces, third-party applications, and to allow services to access heterogeneous search engines in a uniform way.

The CLARIN-FCS Interface Specification defines a set of capabilities, an extensible result format and a set of required operations. CLARIN-FCS is built on the /CQL standard and additional functionality required for CLARIN-FCS is added through SRU/CQL's extension mechanisms.

Specifically, the CLARIN-FCS Interface Specification consists of two components, a set of formats and a transport protocol. The Endpoint component is a software component that acts as a bridge between the formats that are sent by a Client using the Transport Protocol, and a Search Engine. The Search Engine is a custom software component that allows the search of language resources in a Repository. The Endpoint basically implements the transport protocol and acts as a mediator between the CLARIN-FCS specific formats and the idiosyncrasies of Search Engines of the individual Repositories. The following figure illustrates the overall architecture:

                 +---------+
                 |  Client |
                 +---------+
                     /|\
                      |
          -------------------------
         |        SRU / CQL        |
         | w/CLARIN-FCS extensions |
          -------------------------
                      |
                     \|/
 +-----------------------------------------+
 |        |      Endpoint     /|\          |
 |        |                    |           |
 |  ---------------    ------------------  |
 | | Translate CQL |  | Translate Result | |
 |  ---------------    ------------------  |
 |        |                    |           |
 |       \|/                   |           |
 +-----------------------------------------+
                     /|\
                      |
                     \|/
        +---------------------------+
        |       Search Engine       |
        +---------------------------+

In general, the work flow in CLARIN-FCS is as follows: a Client submits a query to an Endpoint. The Endpoint translates the query from CQL to the query dialect used by the Search Engine and submits the translated query to the Search Engine. The Search Engine processes the query and generates a result set, i.e. it compiles a set of hits that match the search criterion. The Endpoint then translates the results from the Search Engine-specific result set format to the CLARIN-FCS result format and sends it to the Client. 

Specifications

CLARIN-FCS is defined in two specifications, the Core specification and the supplementary Data View specification. The first defined the general framework and the latter defined additional Data Views, which allow Endpoints to provide resources in more detailed formats.
The normative versions of the specifications are available from the CLARIN-EU developer Wiki (no access? - look here):

A publicly available PDF of the Core Specification is also available.

Implementations

In most cases, the implementing data provider only builds a simple wrapper-service, that translates between the CLARIN-FCS/SRU-protocol and the endpoint's software. However there are efforts to provide default wrappers (or at least sample implementations) for individual persistence systems like SQL-databases or XML-databases. 

Non-exhaustive list of reference implementations:

  • The IDS has made a reference implementation of a CLARIN-FCS endpoint:
    https://svn.clarin.eu/FCSSimpleEndpoint/trunk/ (NOTE: the library still implements the deprecated version of the specification.)
    The source is basically divided in two parts, the SRU implementation (de.mannheim.ids.sru) and the Cosmas specific implementation (de.mannheim.ids.cosmassru). The latter package also contains the Servlet, that shows how to (programmatic) initialize the endpoints. To use it, the endpoint's developers need to customize "endpoint-config.xml" and implement SRUDatabase interface. CosmasSRUDatabase can be used as an example. Furthermore, the endpoint's develoers create (or change the exiting) Servlet, web.xml and context.xml deployment descriptors.
  • Based on the IDS library, Alex Kislev has implemented a CQP/SRU bridge: any CQP indexed corpus can be integrated quite easily into the CLARIN Federated Content Search.
  • Recently, OCLC announced the oclcsrw, an Open Source implementation of an SRU 1.2 server that exposes a database interface allowing implementers to expose their databases via SRU 1.2. Database implementations are separately available for Apache Lucene and DSpace.
  • The ICLTT (Vienna) is developing corpus_shell a modular framework for publishing heterogeneous distributed language resources building on top of FCS. The system currently contains prototype implementations of a FCS-wrapper for mysql-db (in php), the ddc search engine  (in perl). Additionally a eXist/XQuery-based solution is being developed, but this code has been moved from corpus_shell as module to SADE. These implementations are work in progress and don't yet fully conform to the fcs specs.

Testing Endpoint conformance

Compliance to CLARIN-FCS is done via CLARIN FCS SRU/CQL Conformance Test.

OCLC also offers a SRU Server Tester, but it can only test the conformance to the SRU-protocol, and not CLARIN-FCS.

Existing Endpoints

The full list of implemented endpoints is available at the Centre Registry.