Frequently Asked Questions - CLARIN's technical infrastructure

To top      

Technical Centres

What are CLARIN technical centres?

Centres of the type B, C and E:

  • B is probably what most CLARIN centres will try to become. It is integrated into the infrastructure with all necessary building blocks: a stable repository with metadata, PIDs, -access for protected resources, a classification of its licenses based on the PUB/ACA/RES system, / descriptions for web services (next to the web service core model CMDI files).
  • An E centre offers a service to other CLARIN centres but is not a part of the national CLARIN consortium.

The C centre is the bare minimum: a centre that has web-accessible resources that provides or CMDI metadata about these over .

Whom can I contact?

The answer to this question depends on which centre you want to get in touch with. All centres are part of a national consortium. However, some services are provided at a European level. For detailed questions about the centres, see www.clarin.eu/centres

Who can act as a centre?

Can everyone act as a CLARIN centre in the emerging network? No - since centres need to fulfill a number of criteria mentioned in the Requirements Specification Documents. In particular these centres need to make a commitment statement that they will give their services for a defined period of time at a certain service level which is dependent on the type of service. Setting up centres that adhere to these requirements costs money, therefore it is obvious that centres need to have clear funding basis.

Why do we need centres?

Researchers will only use certain resources and certain tools/services offered via the web when they are sure that they can access them also for a longer period of time. Currently, researchers mostly download resources first to their computer to create accessibility, but in the cyberinfrastructure scenario with many large resources and collections this way is not suitable anymore. So availability and accessibility have to be guaranteed by institutions with a clear service oriented attitude that do pose as little restrictions as possible on the usage by the researchers. Only this new type of centres can give such guarantees.

What is a centre?

To top      

CLARIN Concept Registry

How do I add a new concept for the use with CMDI?
What is the granularity of the definitions?

There is much debate about this and other question and there is no good universal answer yet. However, we need to start using the concept registry to find out how the definitions can be improved, which concepts are missing and which granularity should be chosen for metadata, morphology and semantic annotation to just mention a few examples.

Does it solve all semantic interoperability problems?

No - it is just a start to offer a reference, so that users creating new resources could use the registered concepts and schemas describing legacy data can refer to them. But we will found that not all tag sets which are in use for various purposes can easily be mapped on another one. It also will largely depend on the intended usage. For searching an imperfect mapping may result in less precision, but for a researcher this may not be a problem.

Does it contain relations?

Not yet, as in most cases they are dependent on theories and practical intentions. We're looking in to supporting multiple sets of relations for different needs. From CLARIN we intend to offer at least one set of relations with a large coverage, which users may want to use or manipulate.

What is the CLARIN Concept Registry?

The CLARIN Concept Registry (CCR) is an OpenSKOS instance, which implements the W3C SKOS recommendation and data model. It can be accessed via https://www.clarin.eu/ccr/

Currently it is filled with many concepts from for example the EAGLES project, various metadata initiatives and hopefully other sub-disciplines and initiatives. The national CLARIN Concept Registry Coordinators take care that the registry is not too fragmented and meets a number of criteria.

What is a concept registy?

The CLARIN Concept Registry is a step in the direction of interoperability at the level of linguistic encoding (tag sets, metadata elements, etc.). The basic idea is to register all widely used concepts/terminology in CLARIN so that everyone can refer to them. The CLARIN Concept Registry is based on the W3C SKOS recommendation, which is a generic model not restricted to linguistics.

To top      

Persistent Identifiers (PIDs)

How can I get an EPIC account as a clarin center?
How do I resolve a handle persistent identifier?

A handle exists of 2 parts:

  • a prefix (e.g. 1839)
  • a suffix (e.g. 00-0000-0000-0009-3C7E-F)

The official way of refering to a handle is:

hdl: + prefix + / + suffix

e.g.:

hdl:1839/00-0000-0000-0009-3C7E-F

To resolve such a handle (=make it a clickable link that redirects to the resource itself) use the following formula:

http://hdl.handle.net/prefix/suffix

e.g.: http://hdl.handle.net/1839/00-0000-0000-0009-3C7E-F

 

 

How do I issue a part identifier for an EPIC handle?

The rewriting behaviour of part identifiers can be configured per handle prefix (actually it can also be done per individual handle but this is not supported for at this point). For EPIC (version 1, so with prefix 11858) the choice was made to rewrite [suffix] to ?[suffix]

 So suppose that 11858/1234 resolves to http://clarin.eu then

 11858/1234@test=a will be resolved to http://clarin.eu?test=a

Please note that when you offer PIDs with part identifiers that you are responsible of maintaining the part identification fragment as well. Remember that users will use it to link to your resources and that the resulting end point should always be available.

When should I use a part identifier for a PID?

(Answer taken from the ISO citer draft, p. 11)

This International Standard supports different levels of granularity. The following recommendations are designed to encourage efficiency and promote interoperability with other naming schemes:

1) If there is an existing identifier scheme for a type of resources, for instance, ISBN, this level of granularity should be retained, which is to say that no new PIDs should be issued without very good reasons, such as for chapters. Chapters would preferably be addressed using part identifiers in conjunction with the of the book.

2) If the resource is associated with the complete content of a digital file, an individual PID should probably be assigned for this resource.

3) If the resource is autonomous and exists outside a larger context, an individual PID should probably be assigned for this resource.

4) If a resource should be citable apart from any containing resource, an individual PID should probably be assigned for this resource.

These recommendations are, however, subject to the needs of resource creators with respect to the level of granularity they deem suitable to the specific resource environment.

Why not PURLs or URNs?

PIDs are as said unique and persistent identifiers of objects that are made available by proper repositories. For many resources there are additional characteristics such as multiple copies for preservation reasons, a string (such as MD5) that can be used to check authenticity, simple metadata for citation purposes, a reference to the access permission record etc. A proper system should offer such information immediately when resolving a PID. PURLs can't offer functionality, for URNs we do not know about well-proven and robust resolver, although the big libraries agreed on using URNs for their publications.

Is CLARIN offering such service?

CLARIN has an arrangement with the EPIC consortium that CLARIN members will be able to register PIDs and of course resolve them. This consortium groups a number of reliable European service providers that want to participate in providing a redundant service for the research world, i.e. we are speaking about millions of PIDs and a service at very low costs. The service is based on the Handle System which according to our investigations is the only robust system meeting all requirements. No one is obliged to register Handles, but of course CLARIN centres will need to demonstrate that their PIDs can be resolved in a robust manner and offer the required functionality.

What if the PID service is down?

If the PIDs cannot be resolved at a certain moment one simply cannot access a resource. Think of a situation where hundreds of users are waiting on a resolution of a and nothing happens - a nightmare for any cyberinfrastructure scenario! Since this would not be acceptable, we need to make sure that the PID service is based (a) on a very robust and reliable software offering sufficient functionality, (b) on a proper service based on redundant centres with a high availability and persistency guarantee.

How does it function?

Handling PIDs is very simple. First you need to register a for a resource or service. You can do this very simply by providing the required information to the PID service site, in particular the path to access the resource such as a URL and you will receive back a PID which you can enter into the metadata description for example, so that everyone can use it for referencing. When a user finds such a PID in a resource, he/she can click on this reference and the service will resolve the PID and give access to (one of the copies of) the resource. Normally as user you don't see the intermediate transactions.

Why do I need PIDs?

In the emerging cyberinfrastructure we are creating more and more references between resources, resource fragments and services. The creation of these references is very costly and often is essential for the interpretation of a resource. Therefore we need proper mechanisms to ensure that these references survive despite all the changes that happen in repositories for example. It is known that URLs are not appropriate - they are not persistent even when we believe that they are proper URIs. Therefore special PIDs come into place which identify an object and which are maintained by reliable institutions.

What is a persistent identifier (PID)?

Persistent identifiers are increasingly often seen as core component for all the many references we are creating at various levels - this can range from references between metadata descriptions and their resources up to references between semantic assertions made by using the RDF (Resource Description Framework). For more information please read the requirements specification document or the short guide.

To top      

Standards

My research involves aspects which have not yet been standardized. Can I still make use of CLARIN technology?

Very good question. A researcher will always face this issue. The research moves a field on, and in no-man's-land there are no standards (yet). The standards remain behind the research. Industry stays always on ferm ground, therefore on well establishd conventions. Although it appears that reseach has no means to make use of standards, it should base itself on well-established foundations, which should be expressed in standardized form whenever possible. Only for the head of the arrow, the really fresh things, just invented, the researcher should look for his own ad-hoc conventions. Applied to the linguistic data, this means that in an annotated corpus, for example, one will find a mixture of standard and invented markings. CLARIN can be used for that part of processing that involves using existing tools and resources, that have been converted to a standard format.

I am a linguist. Do I need to have a working knowledge about all these standards?

No linguist should be required to read long documents about standards; it is primarily the task of the tool, service and converter developers to provide frameworks that help the researcher and that hide complex formalisms as much as possible.

What standards are recommended by CLARIN?

An open list follows:

  • character encoding: ISO -10646 UNICODE, UTF-8
  • country codes: ISO 3166
  • language codes: ISO 639-1 and 639-3
  • codes for the representation of names of scripts: ISO 15924
  • text format: XML
  • text format: CSV (comma separated with "-quotes, with a header line and preferrably a line of ISOcat URIs for each column)
  • feature structure representation: ISO 24610-1:2006
  • representation of primary sources: (Text Encoding Initiative)
  • knowledge engeneering: RDF, RDF-S, SKOS, OWL
  • audio/speech: PCM (Pulse Code Modulation) for digitizing sound waves, the Alphabet of the International Phonetic Association for phonetic transcriptions;
  • video/multimodality: MJPEG2000 lossles as backend format, MPEG2 or H.264 for handling and processing
  • annotation of temporal entities: TimeML (part of TC 37/SC 4)
  • morpho-syntactic annotation: MAF (Morpho-syntactic Annotation Framework), ISO/DIS 24611
  • syntactic annotation: SynAF (Syntactic Annotation Framework), ISO/CD 24615
  • lexical annotation: (Lexical Markup Framework), ISO 24613:2008
  • linguistic annotation: LAF (Linguistic Annotation Framework), ISO/DIS 24612
    Is there a standardization action plan in Clarin?

    CLARIN actively tracks a number of ongoing standardisation activities at two major levels: linguistic structures/formats and linguistic encoding. CLARIN as an infrastructure project has the duty to evaluate, test and comment these proposals in close relation with the relevant standardisation bodies. When necessary, CLARIN may take the lead in initiating new standardisation activities when a clear gap in coverage is identified. For more information see the CLARIN Standardization Action Plan.

    Why do we need standards in CLARIN?

    CLARIN does not create linguistic resources; its purpose is to offer rapid access to the existing resources and to facilitate their reuse in new contexts. When resources and tools are produced for individual usage interoperability and therefore the need to adhere to standards or best practices is of little relevance. The problem of interoperability only emerges when linguists are ready to offer their resources and tools to other researchers. One of the requirements of interoperability is to connect different resources to the same tool. This can be made using standards, but this would imply having all the resources standardized (this is an ideal situation, but cannot always be achieved in reality). When needed, a standard can also play the role of a pivot format (resources are converted to the standard before they are used).

    standards