Frequently Asked Questions - Metadata in CLARIN

To top      

Authoring and editing CMDI

What software is available for authoring CMDI metadata?

There a several options for creating records "by hand" (as opposed to having CMDI generated from a primary source by a script, repository system or other software):

  • An XML editor such as oXygen or a text editor or code editor with XML support, preferably with XML Schema awareness.
    • In addition to oXygen, which has a commercial licence, there are also numerous free XML editors. See for example this comparison of XML editors.
  • COMEDI is an online CMDI editor developed within CLARINO that works with any profile.
  • Arbil, the metadata editor developed at the Max Planck Instititue for Psycholinguistics supports any CMDI profile and provides powerful table-based editing. No longer maintained!

Dedicated editors with a limited number of supported profiles:

  • CMDI Maker from the University of Cologne
  • ProFormA, part of the NaLiDa project at University Tübingen. No longer available!
What are the guidelines for creating good quality metadata?
Is Arbil still maintained or supported?

Arbil is no longer maintained and supported by the creators or CLARIN. See the tool page for detailed information regarding support and maintenance.

What version(s) of CMDI does Arbil support?

Arbil supports 1.1. Metadata records that are based on CMDI 1.2 cannot be opened or created successfully in any current version of Arbil. In addition to CMDI 1.1, Arbil also supports . Also note that Arbil is no longer being maintained and therefore cannot be supported by CLARIN.

What are the grey club icons that appear in Arbil's tree view? What does the number after those elements mean?

When a component can occur multiple times (= CardinalityMax higher than 1), Arbil automatically groups all occurrences of these components in the file. You can recognize these by the following properties:

  • they have the grey club icon
  • the text is shown in grey
  • after the node a number indicates how many times the component occurs

E.g. in this example CMDI file there is a fragment that looks like:



What is the meaning of the icons used in Arbil?
How can I see and edit all elements in a CMDI file at once in Arbil?

Right click on the file in the "local corpus" panel and select Edit all Metadata.

When creating a new CMDI file in Arbil, not all elements (fields) are showing up.

Correct observation. The elements that are optional (= have a CardinalityMin of 0) are not shown by default. You need to add them explicitly. To do this, right click on the file in the "local corpus" panel and select Add

I have a question about Arbil that is not answered here. Where can I get support?

Support information can be found on the Arbil tool page. Please not that Arbil is no longer under development and support currently is limited to the forum.

How do I create a new CMDI metadata file?
Using oXygen

In the oXygen XML editor, an easy way to get started is using the "Generate sample XML files" feature:

  • Choose "Generate sample XML files" in the "Tools" menu.
  • Insert the URL of the schema of the desired profile (see "How can I create an XSD (XML schema) from my CMDI profile?") into the URL field.
  • Make sure to select CMD as the root element (possibly overriding Oxygen's default suggestion). This should set the value of the "namespace" field to either ( 1.1) or (CMDI 1.2 or higher).
  • Optionally set the desired default namespace (if you don't know what this does, you can skip this step).
  • Review the options in the "Options" pane, in particular the checkboxes that determine whether optional fields should be instantiated or not.
  • Click "OK" to generate a sample CMDI document.
  • Use this document as a template to create your CMDI records.
  • Use the autocomplete and suggestion functionality of Oxygen's Text or Grid mode to further edit your document(s).
  • Make sure to validate your document regularly, in particular before publishing your metadata.
  • The official documentation provides more information about editing XML documents using oXygen.
Using Arbil

Arbil only supports CMDI 1.1 and should not be used with new or existing CMDI 1.2 records or profiles.

Note that Arbil is no longer being maintained and therefore cannot be supported by CLARIN.

  • Download Arbil (2.6 or higher) and start it
  • Go to Options > Templates & Profiles
  • Select in "Clarin Profiles" which profile(s) you want to use as the basis for a CMDI file anc click on Close
  • Right-click on Local corpus, choose Add and select the relevant profile (the CMDI profiles are marked with a CLARIN icon)
Why can't I see all CMDI profiles from the component registry in Arbil?

Some profiles (obvious tests and the ones not intended for manual metadata creation) have been excluded from the default profile list in arbil (testing). You can see them disabling the "only load profiles selected for manual editing" in the Available Templates & Profiles dialog.

By default, if you create a new profile in the component registry, it will show up in Arbil.

How can I add a reference to an external file (a ResourceProxy) in Arbil?

as of Arbil 2.3  there is an item 'Insert Manual Resource Location' in the context menu of instances in the tree.

Which version of Arbil should I used to edit CMDI files?

Arbil is no longer maintained, therefore we cannot support its usage. 1.1 is supported in the last stable version, Arbil 2.6. It can be obtained via the Arbil download page at The Language Archive. Arbil cannot be used for editing or creating CMDI 1.2. Note that Arbil is no longer being maintained and therefore cannot be supported by CLARIN.

Is Arbil a CMDI editor?

Arbil is indeed a metadata editor with support for files.

It used to be an editor for only, the CMDI functionality has been added later on (since the beginning of 2010). This means that the support for CMDI files was not as extensive as the one for IMDI. However, since release 2.3 of Arbil the support for CMDI has been significantly improved.

Note that Arbil is currently no longer maintained and therefore cannot be supported by CLARIN.

To top      


Should I switch to CMDI 1.2?
What is CMDI 1.2 and how does it affect me?

1.2 is the successor to the CMDI 1.1 metadata framework and is one of the two currently supported versions of CMDI. More information about this specific version can be found at the CMDI 1.2 page. How the introduction of CMDI 1.2 affects you depends on your role within CLARIN. Click one of the following links to find detailed information about the transition to CMDI 1.2 that is relevant to you:

What version of CMDI should I use?

What versions of are there and which should I use?

There currently are two supported versions of the CLARIN's component metadata framework: 1.1 and CMDI 1.2. The former has been in active use for many years and is widely supported within the CLARIN infrastructure. CMDI 1.2 was introduced in 2016 and provides a number of new features and improvements compared to its predecessor. However, its support throughout the infrastructure is still limited (at the time of writing this FAQ, July 2016).

Therefore in order to make a decision about which version of CMDI to use, it's advised to first determine which tools you need your metadata to be processed with. More details about CMDI 1.2, including current information with respect to its support throughout the infrastructure, can be found at the CMDI 1.2 page.

I found a profile that almost matches my needs. Can I add some fields?

The fields of a profile are fixed, so you will need to use a different profile. Don't worry, you can create your own. Since you found a profile that seems to almost match your needs, the most logical thing to do is to create a new profile based on that one.

You can do this yourself, as long as you have a way to login to the Component Registry. Click the 'login' link and select your home institute or another provider you have an account with. If none is in the list, create an account with CLARIN (more info).

When logged in, select the base profile and click the 'Edit as new' button. Save it in your private workspace (under a different name and/or group). The profile consists of links to a number of components (some of which in turn consist partially of links to components), so you will have to identify the components that you need to change. Edit these 'as new', as well and make the required changes. You may have to do this recursively for deeper hierarchies. Then, in your profile, replace the references to the original components with references to your new versions of these components. Save the profile, and test it in an editor  (e.g. oXygen or Arbil) before publishing (you can get the XSD link by selecting the profile in the component browser and choosing 'Show Info' from the drop down menu on the far right. You can open this link in an XML editor or validator; in Arbil you can add it via the 'Profiles and templates' settings.

How can I add a link to the original repository where a resource is hosted? (landing page)

If you want to add a link to the original context of the metadata file, e.g. to the repository where it is hosted (example), add a ResourceProxy of the type LandingPage, e.g.:

 <ResourceProxy id="lp">
How can I indicate that the resources described with a CMDI file are also searchable via a specialised web application?
How can I create an XSD (XML schema) from my CMDI profile?
How can I indicate that the resources described with a CMDI file are also searchable via SRU/CQL?

This can be done with a ResourceProxy where:

  • ResourceType = SearchService
  • mimetype = application/sru+xml


 <ResourceProxy id="d55">
   <ResourceType mimetype="application/sru+xml">SearchService</ResourceType>

For a complete example file see:

Is there a list of recommend components and profiles?

As a starting point, see the list below. We are working to extend it.

Components with controlled vocabularies Other components


Can I use multiple languages in my metadata description?

Yes. If you tick the checkmark next to Multilingual for an element in the Component Registry, it will result in a multilingual field. With the xml:lang attribute you can then indicate the language in which an element has been described, see eg. the following fragment in this example CMDI file:

<!-- Note the support for multilingual fields, using the xml:lang attribute -->
<title xml:lang="eng">mister</title>
<title xml:lang="fra">monsieur</title>
<title xml:lang="nld">mijnheer</title>

For indicating the language we strongly advice to use the ISO-639-3 language code.

Please note that enabling Multilingual will make the element repeatable, even if the Maximum number of occurences is set to 1.

What is the difference between a component and a profile?

Technically there is no real difference. A profile is a component that can be converted into an XSD file. A normal component can only be used within other components or profiles and can never be transformed into an XSD.

The isProfile="true" attribute indicates that a CMD_ComponentSpec defines a profile and not just a component.

How do I know on which profile a CMDI file is based?

The MdProfile element (in the Header section) contains a unique profile code (e.g.: Alternatively you can also find the profile identifier as part of the schema location, for example ( 1.1):

<CMD ... xsi:schemaLocation="">

or (CMDI  1.2):

<cmd:CMD ... xsi:schemaLocation="">

You can find the profile in the component registry with the following URL:


How can I specify additional details about a ResourceProxy?

The information that a ResourceProxy can contain (a URL and mimetype) is kept very minimal, on purpose. However you can use any component to add more details about such a ResourceProxy, using the id attribute.

E.g. in the example CMDI file we can add a textual description of the photo. First the relevant ResourceProxy gets the id "a_photo":

<ResourceProxy id="a_photo">
    <ResourceType mimetype="image/jpeg">Resource</ResourceType>
    <!-- note that both a normal URL and a handle Persistent Identifier can be used for the ResourceRef -->

Then, later on in the same CMDI file, we have an explanantory component example-component-photo with a description element:

<example-component-photo ref="a_photo">
     <description>a suitable textual description of this photo</description>

Thanks to the reference from this component to the ResourceProxy with the ref attribute we know that the description relates to the photo.

Note that the id attribute should be unique for each ResourceProxy.

How do I point to the files I'm describing with CMDI? How does the Resources section work?

Ok, so how can you refer to an external file from a metadata description? That is where the Resources section is for.

In the example CMDI file, the resources section looks like:

      <!-- List of external resource files and (CMDI) metadata files -->
         <ResourceProxy id="a_photo">
            <ResourceType mimetype="image/jpeg">Resource</ResourceType>
            <!-- note that both a normal URL and a handle Persistent Identifier can be used for the ResourceRef -->
         <ResourceProxy id="a_text">
            <ResourceType mimetype="text/plain">Resource</ResourceType>


As you can see, for each link to an external resource a ResourceProxy (= file) is added to the ResourceProxyList (= file list). For each ResourceProxy you need to specify the ResourceType:

  • Resource, the default, for a link to a web-accessible file (e.g. text file, MPEG video, file)
  • Metadata in case you want to build a hierarchy of CMDI files
  • SearchPage, to link to a specialised website where the described resource can be queried (more details...)
  • LandingPage, to link to the "original context", e.g. the URL of a repository system displaying the digital object that is described (more details...)
  • SearchService, to link to a specialised webservice where the described resource can be queried (more details...)

With an optional (but very useful) mimetype attribute you can (surprise!) indicate the file's mime type. The ResourceRef contains either a normal URL or a handle PID.

What parts does a CMDI metadata file have?

Each files exists of 3 parts:

  • a (fixed) Header, containing administrative information:
    • MdCreator: the author of the file
      • e.g. "Eric Carlson"
    • MdCreationDate: the creation date of this file
      • e.g. "2016-12-31"
    • MdSelfLink: the URL or of this file
    • MdProfile: the unique identifier of a CMDI profile, as generated by the component registry
      • e.g. ""
    • MdCollectionDisplayName: an (optional but recommended) plain text indication to which collection this file belongs. Used for the Collection facet in the
  • a (fixed) Resources section, containing links to:
    • external files (e.g. an annotation file or a sound recording)
    • and/or other CMDI metadata files (to build hierarchies)
  • a (flexible) Components section, where the actual components that this profile contains will appear

This example CMDI file illustrates the use of the 3 parts.

Is there a CMDI profile I can use to describe web services?

There are multiple suitable profiles, as described in the CMDI core model for web services (and extended documentation).

See also the following paper:

Windhouwer, M., Broeder, D., & Van Uytvanck, D. (2012). A CMD core model for CLARIN web services. In Proceedings of the workshop on Describing Language Resources with Metadata: Towards Flexibility and Interoperability in the Documentation of Language Resources at LREC 2012 (pp. 41-48).

How can I create a hierarchical collection with CMDI?

Link from the parent .cmdi file to the child .cmdi file with a ResourceProxy that has the ResourceType Metadata.

The recommended profile to use for collection description is ("Collection").

All files of this example collection can be accessed and explored via

Notice that this example is based on 1.1. The same principles apply to CMDI 1.2 in which hierarchies can be constructed in the same matter.

Below is a graphical representation (as shown by Arbil) of the CMDI file hierarchy used above as an example.

OK, my CMDI or OLAC metadata (describing linguistic resources) is ready, how to proceed now?
Where can I find all details on, references to and the background of this component-based metadata concept?

Check out this specification document on metadata.

PLEASE NOTE: The information in this document might be partially outdated. The information on (including these FAQs) is certainly more up to date and should be considered authorative.

If there is no single metadata scheme, how should I describe my resources in order for them to be compatible with the CLARIN infrastructure?

CLARIN proposes a component-based approach: you can combine several metadata components (sets of metadata elements) into a self-defined scheme that suits your particular needs. Of course you can share your profile with others (in fact we strongly advise that). If sharing the full profile is not an option, you still can use common components, e.g. a component to describe a sound recording. In case that still does not address your needs, it is even possible to create components yourself.

So what metadata scheme is used within CLARIN?

Good question. In fact, there is no such thing as a single CLARIN metadata scheme. Practice showed that using a single scheme for a large community (e.g. the Humanities) often results in a mismatch between the chosen elements and the needs of the user.

What metadata schemes are there for the description of linguistic resources?

Quite a few. Examples are: Dublin Core, (which is an enriched version of Dublin Core), , the header.

What is a metadata scheme?

A fixed set of elements for the description of resources. Think of the traditional filing cards in the library, specifying the writer and title of each book.

What is metadata?

Metadata is data about data: information describing properties of linguistic resources, for instance the size of a corpus, the recording date of a speech file, the purpose for which annotations were created.

To top      

Conversion to CMDI

How can I convert my IMDI records to CMDI?

If you have old records in the format you can use the following profiles:

From the profile you can generate the XSD:

And then you can transform your IMDI files into files that comply with the profile with the following set of XSLTs:…



How can I convert my EDM records to CMDI?
How can I convert my PARADISEC records to CMDI?
How can I convert my MODS records to CMDI?
How can I convert my Meta-Share records to CMDI?

If you have records in the Meta-Share maximal format you can use the profiles and conversion stylesheets as described at the Meta site.

If you have records in the Meta-Share minimal format you can use this profile (and the generated XSD).

Then you can use an XSLT transformation to transform your Meta-Share records into the equivalent. For the maximal Meta-Share schema, guidelines and XSLT files are provided here. For the minimal MS-schema to , the XSLT is provided here.

Related to this Jozef Misutka from UFAL has been so kind to implement an OAI-PMH module for the Meta-Share repository.

See also: CMDI interoperability workshop


How can I convert my TEI headers to CMDI?

There is no general procedure to do this, as has many variants and extensions. However, you could follow the following general workflow:

  • Inspect your TEI headers and decide what the relevant parts are. Some information (e.g. layout tags etc.) might be lost during the conversion.
  • Compare your needs with one of the existing TEI profiles (teiHeader type 1, teiHeader type 2, teiHeader type 3) in the component registry. If it fulfills your needs, go to the next steps. If it does not, use the TEI profile as a basis to create your own CMDI profile.
  • Create an XSLT that generates CMDI instances (according to the profile that you chose in the previous step) from the TEI files. (Have a look at olac2cmdi.xsl and imdi2clarin.xsl for some inspiration).
How can I convert my DC or OLAC records to CMDI?

If you have old records in (or , a linguistic extension of DC) you can use the following profile:

From that profile you can generate the XSD:

And then you can transform your DC XML files into files that comply with the profile with the following XSLT:

An example (DC) inputfile:

The corresponding (CMDI) outputfile:

To top      

Harvesting and VLO

Why are some (of my) resources marked as unavailable or having restricted access in the VLO?

All resources linked to from records that appear in the are checked for accessibility at a regular basis as part of the processing by the CLARIN Curation Module.

The link checking information is read by the VLO when importing resources and regularly updated. The VLO discards any information that is too old (more than 100 days). Resources that were found to either be unavailable or require special permissions to access are marked with an icon to indicate either unavailability (warning sign) or restricted access (lock icon). This classification happens on basis of the encountered HTTP response code (for example 404 or 503 is classified as unavailable, 401 or 403 as restricted). These icons are shown in the VLO search results for records with one or more problematic linked resources. The resources detail panel of record pages show a status per link. Each link can be expanded to reveal details regarding the last check: status code and time of checking.

Note that the link checking information may be outdated at the time it is displayed in the VLO. The link checker can only check the status of each resource location with a certain frequency, and the availability status may change.

If you find that a resource found in the VLO is actually unavailable or has restricted access but would like to access it, contact its provider via the landing page (if available) or look for contact information in the 'All metadata' tab.

If you are a resource provider and believe that the displayed accessability status information shown for your resources is incorrect, please contact the VLO developers (

How often is the content of the metadata curation updated?

The CLARIN Curation Module analyses the metadata harvested for inclusion in the . It creates reports for metdata collections on basis of a schedule. Collection reports are generated on Tuesday and Saturday beginning at 7 p.m CET. At the end of the procedure which takes about 2 hours, the curation module web site is reinitialised. This takes a few minutes to complete, after which the latest reports are available. The data and time of analysis is included in the collection report.


How is the order in which records are displayed in the VLO determined?
Why do some of my records not appear in the VLO?

Some or most of my records seem to have been harvested and imported in the fine. However, some records seem to have been omitted. How come?

If some or most of your records have been harvested and imported in the fine, but some records seem to have been omitted, they were probably explicitly skipped by the VLO importer process. Records with no (valid) resource proxies are omitted from the import process. Make sure to always include at least one resource proxy, for example a landing page or link to a resource (see "How does the Resources section work?"). The VLO import process also skips records that are too large. The limit currently lies at 50 megabytes. If you have larger metadata records that you think should be included, please contact us at

Note that metadata files that are not in the 1.2 version get converted before import. The limit applies to the converted file, so certain records may get skipped even if the original file is within the size limit. Typically a converted record should not be more than twice the size of the original file, depending on the formatting. Typically the file size difference is 20% or less.

If files seem to be omitted from hierarchies, but do appear in the VLO in isolation, the most likely reason is that the Resource Proxy reference value (a URL or  ) used in the parent record does not match the self link value (in the MdSelfLink header item) in the referenced record. The VLO will only consider a record to be a parent of another record if it uses the exact self link of the latter to link to it.

What is the update schedule for the metadata in the VLO?

The metadata harvester runs with different configurations at different starting times:

  • Monday and Thursday 20:00 CET/CEST: harvesting of CLARIN providers
  • Friday 20:00 CET/CEST: harvesting of non-CLARIN providers

The  importer for the production VLO runs after every completed harvest (importing all current metadata, both CLARIN and non-CLARIN). 

Combined, harvester/import runs typically take about 12-20 hours to complete, depending primarily on the response rate of the metadata providers.

Because of this schedule, it can take a couple of days before provided metadata becomes available in the VLO. If it takes longer than that, please send a message to

Be aware that the harvester and VLO schedule may be subject to change in the future!

Can I test-drive the VLO import for my CMDI metadata?
Can I harvest myself? Where is the source code of the harvester?
I have a self-developed repository. How can I offer my CMDI files over OAI-PMH?

There are several packages available to setup an OAI provider, some popular examples:

  • file-based, Tomcat web application written in Java: jOAI (some centres have good experiences with this one)

  • database-based, written in PHP: oai-pmh-2

  • library to connect to a Java application: OAIProvider

More tools to setup an OAI data provider can be found at the OAI webpage.

Which sources are harvested for the VLO?
I have multiple OAI prefixes. How does the harvester deal with this?

The harvester uses the namespace url to detect the provided metadata format ( for ), so all prefixes get included automatically. If you nevertheless encounter unexpected behaviour, please contact

How does the mapping to the VLO facets work?

The uses concepts from the profiles (and in some cases XPaths as a fallback). A detailed overview is available at:

More information about this topic can be found in this paper

If I am hosting a hierarchical CMDI metadata collection, do I need to offer all records over OAI-PMH or only the root nodes?
Short answer: as indicated in the protocol, you need to offer all records. There is no automatic harvesting of any of the child nodes.
Long answer: in the case of our toy example hierarchy (
You would need to provide the following files over OAI-PMH:
Providing collection_root.cmdi (or even collection_olac.cmdi and collection_lrt_inventory.cmdi) is not enough, as all OAI harvesters are protocol-agnostic and thus do not know about CMDI’s hierarchy building! CMDI-consuming applications, such as the , also need the physical files locally.

What is this metadata harvesting thing? Are you growing potatoes?
How can I publish my metadata to the Virtual Language Observatory ?

If you have many metadata records or records that frequently change:

  • Use
  • Provide them preferrably as CMDI (click here for details about how to serve over OAI-PMH)
  • If that is not possible provide them as OLAC
  • Depending on the situation your endpoint has to be added to the centre registry (in case of a registered CLARIN centre), or added manually to our harvester

If you have only a few static records and setting up an OAI-PMH access point is not feasible:

  • Submit your records to the Language Resource Inventory  (for data, corpora, lexica, web services, software, ...)
  • All the records in the inventory will be automatically converted into CMDI records. Note that this process can take a while.

Data provided over OAI-PMH or via the LRT inventory will be made available and searchable via the Virtual Language Observatory.

If I deliver CLARIN metadata today, can you make my metadata available to the broad public?

Yes - bringing all metadata descriptions together ("harvesting") and making them searchable ("indexing")  is an important part of the infrastructure that CLARIN is building. When you provide metadata CLARIN can harvest it and make it available via the Virtual Language Observatory.

To top      


The component registry cannot find my concept. What now?

The component registry only searches in the CCR's metadata concept scheme. You can either request your national CCR coordinator to have the concept be assigned to this concept scheme or manually copy & paste the concept's handle in the ConceptLink field in the component registry.

How can I add concepts myself?
Are there concepts that CLARIN uses and recommends in the concept registry?

Yes, go to the site mentioned above and select the concepts with an Approved status in the Status facet.

Where are the concepts stored?

In a concept registry - a server that can be reached via the internet, both by human users and computer programs.

If anybody can create metadata components, how can you still search through the resulting metadata descriptions?

There are indeed issues with searching if people aren't using matching descriptions. Think of someone calling a collection of texts a "text collection", while someone else might be searching for a "(text) corpus". A person can also be labelled a "speaker", "participant", "actor", "author", and so forth. Or think of all the variants that people can use for one and the same country: "the Netherlands", "Nederland", "Netherlands", "Holland", etc. The same goes for lingustic annotations: "noun" and "substantive" can both be used to describe the same part-of-speech tag. To counter these problems the metadata components contain links to a kind of database that contains atomic concepts (say "country" or "resource type"). Smart software will later on be able to "see" that if a user searches for nouns, he might also be interested in substantives, because they either refer to the same concept, or the concepts are marked as related.