Frequently Asked Questions

Why are some (of my) resources marked as unavailable or having restricted access in the VLO?

All resources linked to from records that appear in the are checked for accessibility at a regular basis as part of the processing by the CLARIN Curation Module.

The link checking information is read by the VLO when importing resources and regularly updated. The VLO discards any information that is too old (more than 100 days). Resources that were found to either be unavailable or require special permissions to access are marked with an icon to indicate either unavailability (warning sign) or restricted access (lock icon). This classification happens on basis of the encountered HTTP response code (for example 404 or 503 is classified as unavailable, 401 or 403 as restricted). These icons are shown in the VLO search results for records with one or more problematic linked resources. The resources detail panel of record pages show a status per link. Each link can be expanded to reveal details regarding the last check: status code and time of checking.

Note that the link checking information may be outdated at the time it is displayed in the VLO. The link checker can only check the status of each resource location with a certain frequency, and the availability status may change.

If you find that a resource found in the VLO is actually unavailable or has restricted access but would like to access it, contact its provider via the landing page (if available) or look for contact information in the 'All metadata' tab.

If you are a resource provider and believe that the displayed accessability status information shown for your resources is incorrect, please contact the VLO developers (vlo [at] clarin.eu (vlo[at]clarin[dot]eu)).

How often is the content of the metadata curation updated?

The CLARIN Curation Module analyses the metadata harvested for inclusion in the . It creates reports for metdata collections on basis of a schedule. Collection reports are generated on Tuesday and Saturday beginning at 7 p.m CET. At the end of the procedure which takes about 2 hours, the curation module web site is reinitialised. This takes a few minutes to complete, after which the latest reports are available. The data and time of analysis is included in the collection report.

How is the order in which records are displayed in the VLO determined?

The result ranking is based on relevancy with respect to the query (if applicable) and a number of general record properties. This is further explained in the section Understanding Search Results of the 's help page.

Why do some of my records not appear in the VLO?

If some or most of your records have been harvested and imported in the fine, but some records seem to have been omitted, they were probably explicitly skipped by the VLO importer process. Records with no (valid) resource proxies are omitted from the import process. Make sure to always include at least one resource proxy, for example a landing page or link to a resource (see "How does the Resources section work?"). The VLO import process also skips records that are too large. The limit currently lies at 50 megabytes. If you have larger metadata records that you think should be included, please contact us at cmdi [at] clarin.eu (cmdi[at]clarin[dot]eu).

Note that metadata files that are not in the 1.2 version get converted before import. The limit applies to the converted file, so certain records may get skipped even if the original file is within the size limit. Typically a converted record should not be more than twice the size of the original file, depending on the formatting. Typically the file size difference is 20% or less.

If files seem to be omitted from hierarchies, but do appear in the VLO in isolation, the most likely reason is that the Resource Proxy reference value (a URL or ) used in the parent record does not match the self link value (in the MdSelfLink header item) in the referenced record. The VLO will only consider a record to be a parent of another record if it uses the exact self link of the latter to link to it.

What is the update schedule for the metadata in the VLO?

The metadata harvester runs with different configurations at different starting times:

Monday and Thursday 20:00 CET/CEST: harvesting of CLARIN providers
Friday 20:00 CET/CEST: harvesting of non-CLARIN providers

The importer for the production VLO runs after every completed harvest (importing all current metadata, both CLARIN and non-CLARIN).

Combined, harvester/import runs typically take about 12-20 hours to complete, depending primarily on the response rate of the metadata providers.

Because of this schedule, it can take a couple of days before provided metadata becomes available in the VLO. If it takes longer than that, please send a message to vlw [at] clarin.eu (vlw[at]clarin[dot]eu).

Be aware that the harvester and VLO schedule may be subject to change in the future!

Can I test-drive the VLO import for my CMDI metadata?

Yes. If you contact cmdi [at] clarin.eu (cmdi[at]clarin[dot]eu) we can try to import your newly created metadata in the VLO Alpha instance, so that you get an idea how well the mapping to the VLO facets works.

Can I harvest myself? Where is the source code of the harvester?

You can use the harvester yourself. Its source code is available at GitHub.

By the way, if you only need access to the harvested files, you can also download these as a tarball.

I have a self-developed repository. How can I offer my CMDI files over OAI-PMH?

There are several packages available to setup an OAI provider, some popular examples:

file-based, Tomcat web application written in Java: jOAI (some centres have good experiences with this one)
database-based, written in PHP: oai-pmh-2
library to connect to a Java application: OAIProvider

More tools to setup an OAI data provider can be found at the OAI webpage.

I have a DSpace repository. How can I offer my CMDI files over OAI-PMH?

DSpace comes with a built-in OAI provider, which can be used for this purpose.

Which sources are harvested for the VLO?

The OAI providers from the CLARIN centre registry (= the CLARIN B and C centres)
A (relatively static) list of additional OAI providers (OLAC and CMDI)

More (technical) details about the harvesting process and results can be found in the OAI Harvest Viewer.

I have multiple OAI prefixes. How does the harvester deal with this?

The harvester uses the namespace url to detect the provided metadata format (http://www.clarin.eu/cmd/ for ), so all prefixes get included automatically. If you nevertheless encounter unexpected behaviour, please contact harvester [at] clarin.eu (harvester[at]clarin[dot]eu).

How does the mapping to the VLO facets work?

The uses concepts from the profiles (and in some cases XPaths as a fallback). A detailed overview is available at:

https://vlo.clarin.eu/mapping

More information about this topic can be found in this paper

I have a fedora repository. How can I offer my CMDI files over OAI-PMH?

The CLARIN-D team in Leipzig has written an excellent guide explaining how to do this. (in German)

If I am hosting a hierarchical CMDI metadata collection, do I need to offer all records over OAI-PMH or only the root nodes?

Short answer: as indicated in the protocol, you need to offer all records. There is no automatic harvesting of any of the child nodes.

Long answer: in the case of our toy example hierarchy (http://www.clarin.eu/faq/3454)

You would need to provide the following files over OAI-PMH:

http://www.clarin.eu/cmd/example/collection/collection_root.cmdi

http://www.clarin.eu/cmd/example/collection/collection_olac.cmdi

http://www.clarin.eu/cmd/example/collection/collection_lrt_inventory.cmdi

http://www.clarin.eu/cmd/example/collection/lrt/lrt-1001.cmdi

http://www.clarin.eu/cmd/example/collection/lrt/lrt-1002.cmdi

http://www.clarin.eu/cmd/example/collection/lrt/lrt-1003.cmdi

http://www.clarin.eu/cmd/example/collection/lrt/lrt-1004.cmdi

http://www.clarin.eu/cmd/example/collection/olac/oai_childes_psy_cmu_edu_Biling_DeHouwer.cmdi

http://www.clarin.eu/cmd/example/collection/olac/oai_childes_psy_cmu_edu_Biling_Deuchar.cmdi

http://www.clarin.eu/cmd/example/collection/olac/oai_childes_psy_cmu_edu_Biling_FerFuLice.cmdi

http://www.clarin.eu/cmd/example/collection/olac/oai_childes_psy_cmu_edu_Biling_Genesee.cmdi

Providing collection_root.cmdi (or even collection_olac.cmdi and collection_lrt_inventory.cmdi) is not enough, as all OAI harvesters are protocol-agnostic and thus do not know about CMDI’s hierarchy building! CMDI-consuming applications, such as the , also need the physical files locally.

How can I provide CMDI metadata over OAI-PMH?

See http://www.clarin.eu/node/3014

What is this metadata harvesting thing? Are you growing potatoes?

No, we are not. It is a term used for gathering metadata descriptions from several locations and storing it in a central database. You can find the results of such a harvesting process at https://vlo.clarin.eu

More information about harvesting metadata can be found at http://en.wikipedia.org/wiki/Open_Archives_Initiative_Protocol_for_Meta…

How can I publish my metadata to the Virtual Language Observatory ?

If you have many metadata records or records that frequently change:

Use
Provide them preferrably as CMDI (click here for details about how to serve over OAI-PMH)
If that is not possible provide them as OLAC
Depending on the situation your endpoint has to be added to the centre registry (in case of a registered CLARIN centre), or added manually to our harvester
- In case of a CLARIN centre
  - if your centre has already been registered in the Centre Registry, contact the Centre Registry administration via e-mail. Make sure to provide the name of your centre and the complete URL of the endpoint.
  - if your centre has not yet been registered in the Centre Registry, fill in the Centre Registry registration form.
- Otherwise
  - Send a mail to harvester [at] clarin.eu (harvester[at]clarin[dot]eu) to notify us about your OAI-PMH access point

If you have only a few static records and setting up an OAI-PMH access point is not feasible:

Submit your records to the Language Resource Inventory (for data, corpora, lexica, web services, software, ...)
All the records in the inventory will be automatically converted into CMDI records. Note that this process can take a while.

Data provided over OAI-PMH or via the LRT inventory will be made available and searchable via the Virtual Language Observatory.

If I deliver CLARIN metadata today, can you make my metadata available to the broad public?

Yes - bringing all metadata descriptions together ("harvesting") and making them searchable ("indexing") is an important part of the infrastructure that CLARIN is building. When you provide metadata CLARIN can harvest it and make it available via the Virtual Language Observatory.

Frequently Asked Questions - Harvesting and VLO

CLARIN – the research infrastructure for language as social and cultural data