To enable researchers to search for specific patterns across collections of data, CLARIN offers a search engine that connects to the local data collections that are available in the centres. The data itself stays at the centre where it is hosted – which is why the underlying technique is called federated content search.
The search engine summarises and displays what is available (see Tutorial on how to use the Content Search), no login is required. An easy next step is to go to the centre's specialised search interface to perform a more sophisticated query.
The technology behind this federated content search is SRU/CQL and a CLARIN-specific extension to this protocol.
Federated Content Search vs. Metadata Search
The federated content search approach differs from the metadata search, for example as performed in the Virtual Language Observatory, where all metadata is first harvested (copied to a single server) and then centrally indexed. This is for several reasons:
- Legal issues make it impossible for some resources to be copied to another location
- The size of many datasets makes decentralised indexing the most viable option
- Most language resources are annotated in a collection-specific manner, which makes it hard to use or develop one single search engine that can cope with all of them.
Although more scaleable, federated content search comes at the cost of being less powerful than a local search and certain features are absent, such as ranking.
Federated content search is therefore particularly useful as a first step to discover where interesting language resources are hosted and at which centre(s) a more specialised search could be useful.
Technical Details
For more information on the technical details, see the For Infrastructure Developers section.