This CLARIN Resource family page lists corpora and lexica for Sign Languages (SL). A substantial number of these can be found in the CLARIN but deserve a special podium in the spotlight.
Sign language (SL) corpus resources contain transcriptions/annotations of spontaneous or elicited dialogues and narratives. All resources are in a video format because of the gestural/spatial-visual modality, a vital characteristic of signed languages (sign languages, used by Deaf-blind signers, can be received in tactile modality). SL corpora are crucial resources for various types of linguistic research, such as lexicography, phonology, syntax, and pragmatics, as well as for language typology.
This page also provides access to SL lexical resources. Some of them are connected to SL corpora. There also are independent lexical resources that were primarily created for language learning and teaching.
This page was constructed by harvesting metadata for SL corpora in three different ways:
- By making an inventory of the material (datasets and resources) offered by CLARIN K-Centres with expertise in SL.
- By making an inventory of other datasets in the VLO which may qualify as members of the new CRF by contacting the right holders.
- By making an inventory of any other material (e.g., new datasets, annotation tools, manuals) not yet accessible through the CLARIN Infrastructure by sending out questionnaires to SL communities.
For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).
Sign language resources in the CLARIN infrastructure
Corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Adamorobe Sign Language Corpus Size: 90 MPG1 and 90 MPEG2 clips |
Adamorobe Sign Language | The Adamorobe Sign Language Corpus contains almost 36 hours of videorecordings of Adamorobe Sign Language, filmed in Adamorobe in Ghana between 2000 and 2004 by Victoria Nyst. The deposit contains recordings of approximately 20 signers. The 39 original tapes were digitized, cut, compressed and converted into MPG1 and MPEG2 digital clips using the standard settings of the MPI in Nijmegen. The total number of clips is 90 MPG1 and 90 MPEG2. There are 27 complete synchronized Elan-transcriptions in English and in Twi, which is the Akwapim variety of Akan, the spoken language in Adamorobe. The recordings include spontaneous narratives, personal stories and stories about the history of Adamorobe, elicited data, retellings of cartoons and picture stories. | Download |
|
Balinese |
The collection includes sign language data from deaf homesigners in Bali, Indonesia. The data was collected between 2021 and 2023. The collection is available for download from the Language Archive. |
Download |
Size: 200 hours |
Catalan Sign Language (LSC), German Sign Language (DGS), Italian Sign Language (LIS), Dutch Sign Language (NGT), Spanish Sign Language (LSE), and Turkish Sign Language (TİD). |
This is a collection of datasets connected to the Sign-Hub project. The corpus contains interviews conducted with elderly Deaf signers from five countries on their life experiences as well as a documentary movie based on these interviews. These interviews were conducted in five of the participating countries of the SIGN-HUB project and in six different sign languages: Catalan Sign Language (LSC), German Sign Language (DGS), Italian Sign Language (LIS), Sign Language of the Netherlands (NGT), Spanish Sign Language (LSE), and Turkish Sign Language (TİD). In each country, interviews have been conducted in different geographical areas. The exact number of interviews differs per sign language, but for every sign language, at least 20 interviews have been conducted, with interviewees being between 66 and 97 years of age. Interviews followed a pre-defined questionnaire; however, the addition of country-specific questions was encouraged. This collection is available for download from the Ortolang repository. For the relevant publication, see Pfau et al. (2021) |
Download |
Size: 341 clips in MPG1 and MPG2 format |
Dogon Sign Language |
The Dogon Sign Language Corpus contains 32 hours of video data, recorded in the Dogon area in Mali between 2010 and 2012. These recordings are cut into 341 shorter clips varying lengths, in MPG1 and MPG2 format. The recordings feature the signing of 41 men and 27 women. The average age of all signers was 30 years. Recordings were made in 13 locations. Following approaches developed in earlier sign language corpora, the following he types of data were collected for the the Dogon Sign Language Corpus:
Metadata are stored in the sign language format, using the ARBIL editor software. The entire corpus, i.e. the video clips, annotations and metadata, is stored in the DoBeS archive at the Max Planck Institute for Psycholinguistics in Nijmegen.
|
Download |
Hotel Review Corpus – Dutch Sign Language Size: 21,825 words; 3.5 hours |
Dogon Sign Language (NGT) |
This is a multimodal parallel corpus of hotel reviews that were originally written in Dutch, and subsequently translated into the Dutch Sign Language by 6 professionals, all of whom are deaf translators. The corpus is available for download from the Institute of Dutch Language. |
Download |
Size: 2375 sessions |
Dutch Sign Language (NGT) |
This corpus contains sessions with linked media files and ELAN annotation files (EAF); about 15% of the sessions are glossed and translated. For the relevant publication, see Crasborn et al. (2008) |
Download |
Size: 23 hours |
Dutch Sign Language (NGT) | This corpus contains 15 spontaneous dialogues and multi-participant conversations by deaf signers, 10 of which were recorded in authentic settings like a deaf club and a bar, 5 were recorded in the lab. In addition, two informal three-party conversations were filmed where each participant was wearing a mobile eye trackers. | Browse |
Size: 32 recordings |
Dutch Sign Language (NGT) | The Visibase corpus is a collection of digitised and described NGT material that was present in the late 1990s at the sign language research groups at the University of Amsterdam and at Leiden University. The project lasted from 1996–2001 and was based at Radboud University, University of Amsterdam and Utrecht University. | Download |
Size: 76 MPEG1 recordings |
Dutch Sign Language (NGT), British Sign Language (BSL), Swedish Sign Language (SSL), German Sign Language (DGS) |
The corpus contains recorded sign narrations of five fable stories, a small lexicon, and interviews with the signers for each of the three languages. In addition, there is sign language poetry from BSL, NGT and SSL. Finally, the corpus includes two annotated segments of the Gehörlos So! corpus of German Sign Language (DGS) by Jens Heßmann. For the relevant publication, see Crasborn et al. (2007) |
Download |
Corpus-PhD-Fusellier-Souza-2004 Size: 10 discourses with 3 Deaf emerging signers in Brasil |
Emerging Sign Languages (in Brazil) |
This is a corpus containing 10 discourses with 3 Deaf emerging signers in Brasil. The corpus is available for download from the Huma-num repository (COCOON). |
Download |
AddictionLink in Finnish Sign Language Licence: Under negotiation |
Finnish Sign Language (FinSL) | This corpus contains written and recorded (audio and video) materials pertaining to alcohol, drugs and addictions, on independent change programs and a self-assessment test on the use of alcohol. | |
Consumer Information in Finnish Sign Language Licence: Under negotiation |
Finnish Sign Language (FinSL) |
This corpus contains written and recorded (video) materials pertaining to advice aimed at consumers with regards to e.g. product defects, service related complaints, canceling orders and online shopping. The corpus is available for download from a dedicated webpage. |
Browse |
Finnish Sign Language Learning Material Licence: Under negotiation |
Finnish Sign Language (FinSL) |
This corpus contains written and recorded (audio and video) materials pertaining to Finnish sign language greetings, names of family members, numbers and telling the time, as well as basic verbs and related words. The corpus is available for download from a dedicated webpage. |
Browse |
Licence: Under negotiation |
Finnish Sign Language (FinSL) |
This corpus contains recordings of Finnish news. The corpus is available for download from a dedicated webpage. |
Browse |
Size: 163 minutes |
Finnish Sign Language (FinSL) |
This is a video corpus of the language policy program for the National Sign Languages in Finland translated by two people who speak the sign language as their mother tongue. The corpus is available for download from the Finnish Language Bank. |
Download |
Translations of the Bible and of the Church Manual into Finnish Sign Language Licence: Under negotiation |
Finnish Sign Language (FinSL) |
This is a video corpus of Bible translations (including The Gospels of John and Luke and the Old Testament, Genesis 1:1-4:16, 6:1-9:17), mass and other religious ceremonies, as well as other religious documents. The corpus is available for online browsing through a dedicated webpage. |
Browse |
Belgian Covid Sign Language Corpus (BeCoS Corpus) Size: 177 hours of speech |
Flemish Sign Lagnuage (VGT), The French Belgian Sign Language (LSFB), French, Dutch |
This corpus consists of the entire archive of official press conferences from the Belgian Federal Government concerning the COVID-19 pandemic. The speakers speak mostly Dutch or French and occasionally German, and nearly all speech is accompanied by a deaf signer who interprets live what is said. The corpus is available for download from the Dutch Language Institute. |
Download |
Size: 140 hrs / 5 TB |
Flemish Sign Language (VGT) |
This is a collection of videos in Flemish Sign Language. 120 deaf people contributed to the Corpus VGT as informants. Age, region and gender were taken into account when selecting the informants. The informats were given a series of themes to talk about in pairs: telling a story, making agreements, discussing a theme, telling about their school days, etc. The conversations were recorded on video and edited them for each assignment. The corpus is available for download from the Dutch Language Institute and for browsing through a dedicated website. |
|
Hotel Review Corpus – Flemish Sign Language Size: 21,825 words; 4 hours |
Flemish Sign Language (VGT), Dutch |
This is a multimodal parallel corpus of hotel reviews that were originally written in Dutch, and subsequently translated into the Flemish Sign Language by 6 professionals, all of whom are deaf translators. The corpus is available for download from the Institute of Dutch Language. |
Download |
Size: 50 hours of recording in total. |
French Sign Language (LSF) |
This is a corpus of children's LSF collected from 65 deaf children and 17 deaf adults (control group), conducted by four deaf interviewers from four different regions of France (4 stimuli, 2 cameras). 50 hours of recording in total. A sample of the corpus (10 extracts of Tom & Jerry cartoon narrative, filmed with two cameras (20 files)) is available for download from the Huma-Num repository. For the relevant publication, see Balvet et al. (2010) |
Download |
Size: 7 extracts filmed with 3 cameras (21 files). |
French Sign Language (LSF) |
This is a corpus of dialogues between Deaf adults (106 hours of video data): 51 interviews, conducted by four Deaf interviewers from four different regions of France (semi-directive interviews, 3 cameras). A sample of the corpus (7 extracts filmed with 3 cameras (21 files)) is available for download from the Huma-num repository. For the relevant publication, see Garcia et al. (2013) |
Download |
Size: 2 hours |
French Sign Language (LSF) |
This is a reference corpus for LSF, recorded in January 2002 in Paris, involving 13 Deaf adults (monologues). The corpus is divided in 5 video files of various length. It contains a description in French (some metadata) and a translation in French of narratives and other discourses following the time code. The topics and genres included are: "Le Récit du Cheval" (narrative), "Le Récit des Oiseaux" (narrative), "L'Euro" (argumentative discourse), "La Recette de Cuisine" (cooking recipe), "Le 11 septembre 2001" (argumentative and narrative discourse) et "Le Thème Linguistique" (metalinguistic discourse). The corpus is available for download from the Huma-num repository. For the relevant publication, see Cuxac et al. (2002) |
Download |
Size: 368 subtitled videos |
French Sign Language (LSF) |
This is a 2D-skeleton video corpus of LSF with French subtitles. The corpus consists of 368 subtitled videos produced by Média’Pi4, a media company producing bilingual content with LSF and written French. The corpus was produced at the Laboratoire d’informatique pour la mécanique et les sciences de l’ingénieur (LIMSI). From the original videos, 25 body keypoints, 2x21 hand keypoints and 70 face keypoints were extracted using OpenPose. 135 keypoints for every person in every frame of the 368 videos were provided, as well as the associated subtitles in French. The corpus is available for download from Ortolang. |
Download |
Annotation: partially annotated corpus |
French Sign Language (LSF) |
This is a corpus of French Sign Language (LSF) captured with a motion capture system and an HD camera. It was designed with the objective of carrying out multidisciplinary studies in Movement Sciences, Linguistics and Computer Science. The corpus consists of 5 tasks of different natures: description, explanation, narration and translation, performed by 4 speakers (8 for the description task). The corpus is available for download from the Ortolang repository. For the relevant publication, see Benchiheub et al. (2016) |
Download |
Size: 11 poems in LSF and 57 translations in French (several versions for each poem) |
French Sign Language (LSF), French |
This corpus contains eleven poetic works in LSF (French sign language) and their fifty-seven translations into oral French. The corpus is available for download from the Ortolang repository. For the relevant publication, see Catteau (2020) |
Download |
Size: approx. 10 samples |
French Sign Language (LSF), French |
This is a corpus of spontaneous exhanges between either hearing and deaf children on the one hand and either hearing or deaf parents on the other. A sample of the corpus is available for download from the Ortolang repository. For the relevant publication, see Leroy et al. (2009) |
Download |
Size: 6 sessions x 4 video files |
French Sign Language (LSF), Spoken French |
The theme of the dialogues is the description of routes and places in Marseille and Aix-en-Provence in France. The corpus is composed of 3 dialogues in LSF and 3 dialogues in French. Each dyad is composed of a moderator and a speaker. There is a single moderator for French and two moderators for LSF. The recording equipment consisted of 3 cameras and 2 headset microphones for the French spoken part. The corpus is composed of 6 sessions: 1, 2 and 3 for French and 4, 5, 6 for LSF. Each dyad is composed of a speaker located on the right of the overview noted A, and a moderator located on the left of the overview noted B. Thus, for session 1, the speakers are conversing in French, A1 is the speaker located on the right of the overview and B1 is the moderator located on the left of the overview. For each session there are 4 video files (mp4/AVC): 1 for the speaker, 1 for the moderator, 1 which gives a profile view of the two speakers, the overview, and 1 which is a montage of these 3 videos. All the files are synchronised. For the LSF part, there is no sound track in the videos. For the French part, there are 2 sound files (wave) in addition to the video files, 1 per speaker. The first 3 videos do not contain a sound track. Only the editing video contains sound, that of the speaker on the right in the right channel and that of the moderator on the left in the left channel. For the relevant publication, see Braffort and Boutora (2012) |
Download |
Size: Text feature: 10 file, Video feature: 25 hour, Data format: MPEG-4 |
French, Modern Greek, German, English, Greek Sign Language, British Sign Language (BSL), German Sign Language (DGS), French Sign Language (LSF) |
Multimedia corpus (video) for four sign languages (english, french, german and greek) of at least 14 informants per language and a session duration of approx. 2 hours using the same elicitation materials (scripts and tasks) across languages. For the relevant publication, see Efthimiou et al. (2010) |
Browse |
Size: +50 hours |
German Sign Language (DGS) |
The DGS Corpus is a collection of German Sign Language (DGS) data from 330 signers from Germany. The 15-year long-term project is based at the Institute of German Sign Language and Communication of the Deaf at the Universität Hamburg and started in 2009. It is led by Thomas Hanke and Annika Herrmann. The DGS Corpus is used to build the DGS-German dictionary DW-DGS For the relevant publication, see the list of publications |
Download |
Licence: Restricted, see here |
Italian Sign Language (LIS) |
The Italian Sign Language Corpus is a collection of Italian Sign Language (LIS) data from 180 signers from Italy. The core part of the project involved three universities: University of Milan-Bicocca, University Ca’Foscari and Sapienza University. The corpus is available for download from MPI's Language Archive (CLARIAH-NL). |
Download |
Kata Kolok Child Signing Corpus Size: Data from four focal deaf children accumulates to 95h 24min (Lutzenberger 2022:282). |
Kata Kolok (Benkala Sign Language) |
This corpus covers spontaneous child-caregiver interactions focused on five deaf and eight hearing children acquiring Kata Kolok natively. Ages range between 4 months and 8;4 years of age. The corpus is not freely accessible due to the vulnerable target group. Contact person: %20h.lutzenberger [at] bham.ac.uk (Hannah Lutzenberger) For the relevant publication, see Lutzenberger (2022) |
Browse |
Size: 63.5; data collection ongoing |
Kata Kolok (Benkala Sign Language) |
This corpus includes a wide range of elicited and spontaneous language materials accumulating to 100 hours of video data from generation III-V of adult deaf and hearing signers. Ongoing data collection (anno 2022) is focused on generation III as they are currently among the eldest KK signers. For the relevant publication, see de Vos (2016) |
Browse |
Size: 27 minutes |
Marajó Sign Language (Brazil) |
This is a corpus of sign language practiced in Soure, on the island of Marajó (Brazil, Pará). These data were collected between July and August 2015 and in March 2017. This corpus is available for download from the Ortolang repository. The videos made available for download represent part of the total corpus of 8 hours and 27 minutes. They consist of elicited stories (9 minutes and 27 seconds) and spontaneous speech (17 minutes and 13 seconds). |
Download |
Corpus Maurician Sign Language by Univ Paris 8 & INJS Size: 19 discourses (narratives and other genres) |
Maurician sign Language (LSM) |
This is a corpus of 19 discourses (narratives and other genres). The corpus is available for download from the Huma-num repository (COCOON). |
Download |
Size: 3,600 sentences |
Modern Greek, Greek Sign Language |
This is a parallel corpus for the language pair Greek Sign Language (GSL) – Greek. The corpus incorporates sentences performed by a single signer in three repetitions each, captured in front view by means of one HD and one kinect camera. Annotation of the corpus has used the iLex annotation environment and provides information for the grammar levels of lexicon, morphology, syntax and semantics, incorporating annotation tiers for gloss, classifier type, shape and semantics, clause type, sentence type and equivalent translation in Greek on sentence level. The Corpus consists of 3500 ELAN (.eaf) files. The corpus is available for download from CLARIN:EL, though access requires registration. For the relevant publication, see Efthimiou et al. (2018) |
Download |
"Exhibition Corpus" - Text, Sound, Sign Size: 23 texts |
Norwegian Bokmål, Norwegian Nynorsk, Norwegian, Norwegian Sign Language (NSL) | This corpus contains texts produced during a 2013 exhibition about languages - "Leve Språket". The exhibition aimed at showing the linguistic diversity in Norway, and it covered topics such as language conflict, the understanding of neighbouring languages and linguistic humor. The target audience was teenagers in school, and the texts are formulated accordingly. The texts were translated into Norwegian Sign Language and either Norwegian Bokmål or Nynorsk. The texts were also recorded to serve as an audio guide in the exhibition room. | Download |
Norwegian Sign Language Corpus Size: 8 video clips, 18 minutes |
Norwegian Sign Language (NSL) |
This corpus consists of data collected in 2007 for the purposes of a doctoral research project about boundary markers in Norwegian Sign Language. Four signers were filmed: two men and two women, both young and old. They are all deaf with deaf parents, siblings, or other family members. They live in central Eastern Norway, and all have gone to the deaf school in the area. The signers were asked to retell a children’s picture book entitled “Frog, Where Are You?” by Mercer Mayer and also to respond to the question “What happened on 9/11 and what did you do?” Video recordings of the signers were made in a studio, and sessions were led by a deaf adult man who is an L1 signers of Norwegian Sign Language. No other people were present during the recordings. The corpus is available for download from the CLARINO repository. |
Download |
Hotel Review Corpus – Spanish Sign Language Size: 20,609 words; 3 hours of videos |
Spanish Sign Language (LSE), Spanish |
This is a multimodal parallel corpus of hotel reviews that were originally written in Dutch, subsequently translated into Spanish and finally into Spanish Sign Language by 6 professionals, all of whom are deaf translators. The corpus is available for download from the Institute of Dutch Language. |
Download |
Corpus LS Tunisienne (Fadwa Mhimdi) Size: 10 narrative discourses |
Tunisian Sign Language (TSL) |
This is the first scientific corpus of narrative discourses in Tunisian Sign Language (LST) by Deaf adults. The data were filmed in the Tunis region. The corpus is available for download from the Ortolang repository. |
Download |
Turkish sign language database Licence: Restricted, see here |
Turkish sign language (TİD) |
This corpus collects Turkish sign language (TID) data. For this project, native, early, and late TID signers were recorded performing different tasks (narratives of short picture stories/cartoon clips) and engaging in free conversation. These recordings and their annotations are stored in this corpus. The corpus is available for download from the MIP (CLARIAH-NL distribution). |
Download |
Annotation: EAF transcripts |
Turkish Sign Language (TİD) and German Sign Language (DGS) | This is a corpus of DGS and TİD data collected by the Max Planck Institute for Psycholinguistics under the lead of Asli Özyürek from March 2007 to September 2012. | Download |
Lexical resources
Corpus | Language | Description | Availability |
---|---|---|---|
Adamorobe Sign Language Lexicon Size: 250 signs |
Adamorobe Sign Language |
This lexicon contains 250 signs in isolation. For a subset of the signs, encodings about phonological and iconic features are available. The lexicon is available for download from the MPI Language Archive. |
Download |
Licence: Public |
British Sign Language (BSL) |
This lexicon was derived from the British Sign Language Corpus and is part of the ECHO case study on sign languages. The lexicon is available for download from the MPI Language Archive. |
Download |
Annotation: unannotated |
Chinese Sign Language (CSL) |
This lexicon demostrates how a Deaf adult signs a story to Deaf children. The lexicon is available for download from the MPI Language Archive. |
Download |
Czech Sign Language Corpus for Recognition – Amateur Signer Licence: ELRA |
Czech Sign Language | This is an amateur sign-language database comprising 25 signs from Czech sign language. 15 signers (4 women and 11 men) carried out 5 repetitions of each sign and were recorded from 3 different views. The first is a frontal view of the upper part of the body. The second one is similar, but with the camera placed about one meter higher than the first one so as to produce a frontal top-view, and thus allowing to detect 3D information. The last view is a frontal-detail view of the speaker's face, thus allowing lip-reading. | |
Czech Sign Language Corpus for Recognition – Professional Signer Size: 378 signs |
Czech Sign Language |
This lexicon comprises signs performed by 4 everyday sign-language users (4 women, 2 of them deaf). 5 repetitions of each sign were recorded from 3 different views. The first is a frontal view of the upper part of the body. The second one is similar, but with the camera placed about one meter higher than the first one so as to produce a frontal top-view and thus allowing to detect 3D information. The last view is a frontal-detail view of the speaker's face, thus allowing lip-reading. For the relevant publication, see ELRA (European Language Resources Association) |
|
Size: 300 signs |
Dutch Sign Language (NGT) |
This lexicon forms part of the ECHO case study on sign languages. The lexicon is available for download from the MPI Language Archive. |
Download |
ECHO NGT lexicon, Male signer 2 Size: 300 signs |
Dutch Sign Language (NGT) |
This lexicon forms part of the ECHO case study on sign languages. The lexicon is available for download from the MPI Language Archive. |
Download |
ECHO NGT lexicon, female signer 2 Annotation: unannotated |
Dutch Sign Language (NGT) |
This lexicon forms part of the ECHO case study on sign languages.The signer retells the fable The Shepherd Boy and the Wolf. The source of the retelling is a Dutch version of the fables by author Paul Biegel, consisting of approximately 300 words. The lexicon is available for download from the MPI Language Archive. |
Download |
Woordenboek Vlaamse Gebarentaal Size: 7.5 hours |
Flemish Sign Language (VGT) |
This resource contains contains the video material of the Dictionary of Flemish Sign Language. The 10,025 videos contain a gesture per video. The dictionary is available for download from the Dutch Language Institute. |
Download |
Size: 1000 entries per language (video and text) |
French, Modern Greek (1453-), German, English, Modern Greek Sign Language, British Sign Language, German Sign Language, French Sign Language |
This is a multilingual lexicon in which concepts are linked to graphically represented signs and accompanying videos showcasing the signing process. The videos are annotated with HamNoSys ("Hamburg Sign Language Notation System"). The lexicon is available for online browsing via a dedicated interface. For the relevant publication, see Efthimiou, S-E. Fotinea, et al. (2010) |
Browse |
Size: 8,616 lemmas |
Greek Sign Language (GSL) |
This is an online dictionary of lemmas taken from three previously developed resources, namely (i) the NOEMA DB, from which it incorporates 3,000 revised entries, (ii) the GSL segment of the Dicta Sign Corpus, from which it incorporates 2,000 entries, and the POLYTROPON Parallel Corpus corpus, from which it incorporates 3,616 new entries. The lexicon is available for online browsing through an interface provided by the CLARIN:EL consortium. For the relevant publication, see E. Efthimiou, S-E. Fotinea, et al. (2019) |
Browse |
Size: 3,000 video entries |
Greek Sign Language (GSL) |
This dictionary contains video recorded signs paired with Modern Greek translations. The dictionary incorporates explanatory remarks that help non-native GSL users understand the meaning of the sign, while at the same time allowing for native GSL signers to enrich their Modern Greek vocabulary. The dictionary allows users to search by lemma, which means either by (i) hand shape, (ii) lemma classification according to syntactic category, or (iii) by the alphabetic ordering of the sign translations in Modern Greek. The dictionary is not available online. |
|
Size: 300 signs |
Swedish Sign Language (SSL/STS) |
This lexicon forms part of the ECHO case study on sign languages. The lexicon is available for download from the MPI Language Archive. |
Download |
Other sign language resources
Corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Size: BSL video data from 249 deaf signers of BSL |
British Sign Language (BSL) |
The British Sign Language Corpus is a collection of British Sign Language (BSL) video clips of 249 deaf signers from the UK. The BSL Corpus project is based at the Deafness Cognition and Language Research Centre, University College London, lasted from 2008–2011 and was led by Adam Schembri. A related dataset is the BSL Signbank. For the relevant publication, see Schembri, A., et al. (2012) |
Download |
Corpus of the Danish Sign Language Dictionary Size: 4.5 hours |
Danish Sign Language (DTS) |
This corpus consists of video material from 31 signers of DTS from Denmark. The Corpus is used to build a DTS-Danish Dictionary. The Danish Sign Language Dictionary project building the corpus is based at the Bachelor’s Degree Programme in Danish Sign Language and Speech-to-text Interpreter at the University College Copenhagen and led by Mads Jonathan Pedersen and Thomas Troelsgård. The project started 2014 and is still ongoing. For the relevant publication, see Kristoffersen and Troelsgaard (poster) |
|
Size: Around 500 hours |
Dutch Sign Language (NGT) | This corpus contains three sets of data. The first is a set of longitudinal data of deaf children from deaf and hearing parents that has been collected at the UvA since the late 1980s. The second is a new collection of longitudinal data collected at the RU from hearing and deaf children of deaf parents (2008–2020). Thirdly, data collected in an educational context by Nini Hoiting at the Kentalis Guyot school. | Browse |
Corpus of Finnish Sign Language Size: 14 hours 22 minutes |
Finnish Sign Language (FinSL), Finnish |
The corpus consists of video-recorded conversations and elicited narratives from 21 Finnish Sign Language signers who belong to different age groups and live in different parts of Finland. The signers perform seven fixed tasks which are
. All of the video data (14.5 hours by six camera angles) has been annotated for signs and translations. According to the tasks performed by the signers, the corpus has been divided into two subcorpora: one that contains the elicited narratives, and another that contains the conversations.
The corpus is available for download from the Meta-Share (FIN-CLARIN Distribution). For the relevant publication, see Salonen et al. (2020) |
Download |
Licence: CC BY-NC-SA 4.0 |
Flemish Sign Language (VGT), Swiss-German Sign Language (DSGS) | This is a collection of six datasets recorded and created by the Content4All research project. The datasets are hosted by University of Surrey and are password protected. To request download credentials, please contact %20r.bowden [at] surrey.ac.uk (Richard Bowden). | Download |
Corpus LSFB (University of Namur) Size: 10 hours |
French Belgian Sign Language (LSFB) |
This is the first large-scale digital corpus that illustrates the current use of French Belgian Sign Language (LSFB) and all its variations. It was first conceived for linguistic research. However, this digital library is an unprecedented tool for teachers, students and interpreters, as well as a safeguard of the linguistic and cultural heritage of the Deaf Community. |
|
Hungarian Sign Language Corpus Size: 30 hours (Grammatical Corpus) |
Hungarian Sign Language |
The Hungarian Sign Language Corpus is a collection of Hungarian Sign Language (HSL) video data of 147 signers from Hungarian. Overall, 1,750 hours were recorded. The HSL corpus project ran from 2016 to 2017, was based at the Research Institute for Linguistics at the Hungarian Academy of Sciences and led by Csilla Bartha. For the relevant publication, see Bartha et al. (2016) |
|
|
Irish Sign Language (ISL) |
The Signs of Ireland Corpus is a collection of Irish Sign Language (ISL) video data from 40 signers of Ireland. The project was based at the Trinity College Dublin, took place in 2004 and was led by Lorraine Leeson. For the relevant publication, see Leeson (2011) |
|
|
Polish Sign Language (PJM) | This is a corpus of video data from 150 Deaf native signers of Polish Sign Language (PJM). | |
Annotation: tokenised, lemmatised, gestural annotation, mouth shape, ID-gloss |
Slovene Sign Language (SZJ) | This corpus is available for querying in its transcribed version providing an avatar demonstration of each sign. The corpus contains interviews with 80 informants. The entire corpus is currently not publishable due to data protection issues; however, permissions for publication are being collected in order to release the recordings too. | |
Size: 4 hours 52 minutes / 48 recordings |
Spanish Sign Language (LSE) |
This corpus is intended for the analysis of LSE argument structure, focusing on how signers organize the names (and the forms similar to the names) and the verbs (and other forms that have a predicative function) to communicate who does what, or feels what, or talks about what, etc. It was not intended to create a representative and structured corpus, but rather a set of examples that would allow basing the grammatical description on contextualized uses. Only a part is accessible through the iSignos website. The corpus is annotated as follows: there are right-hand and left-hand id-glosses and glosses for classifiers, translation into Spanish and role-shift, PoS, argument structure, locus and animacy (2 hours and 21 minutes). Other part just with glosses, translation into Spanish and role-shift; some recordings (16) also have analysis of the non-manual component For the relevant publication, see Pérez et al. (2019) |
Concordancer |
Annotation: annotated for right-hand and left-hand id-glosses and glosses for classifiers and Spanish translations |
Spanish Sign Language (LSE) |
This corpus consists of a set of video recordings of signers who express themselves in LSE, presented together with the glosses of both hands and the Spanish translation. In the first stage, a set of videos with their corresponding glosses and translations are available, which will be expanded in successive phases. You can consult the list of recordings and select by genre or theme criteria, and also by the sex or age range of the signers.The resource can be useful for all those people who need this type of linguistic data for their work, for example, for class exercises, interpretation practices, language evaluations, research on LSE, etc. The corpus is available through a dedicated sarch engine that allows you to explore the corpus and observe the context in which the searched glosses appear. |
Browse |
Size: 24 hours |
Swedish Sign Language (STS), Swedish |
This is a web-based version of the Swedish Sign Language Corpus, consisting of approximately 93,000 annotated sign tokens. Previously, the corpus was only available through the special-purpose video annotation tool ELAN. The aim of this corpus is to provide a picture of what sign language sentences look like, but also contribute new characters and variants to the Swedish Sign Language Dictionary. It can also be used to develop teaching materials. For the relevant publication, see Öqvist et al. (2020) |
Concordancer |
Tactile Swedish Sign Language Corpus Size: 4.5 hours |
Swedish Sign Language (STS), Swedish | This corpus contains dialogues and elicited narratives with 9 deafblind informants. The entire corpus is currently not publishable due to data protection issues; however, some parts are available through the STS-korpus. The project was funded by Mo Gård Research Fund. | Concordancer |
Giving Recognition a Hand Corpus Size: 84 videos |
Turkish Sign Language (TİD), Dutch Sign language (NGT) |
This is a multilingual corpus of Turkish Sign Language (TİD) and Dutch Sign Language (NGT) as well as Turkish and Dutch data. It contains 84 video files of signers and speakers from Istanbul and Nijmegen. The project was based at the Max Planck Institute for Psycholinguistics, Centre for Language Studies. The corpus is available for download from a dedicated webpage. |
Download |
Lexical resources
Corpus | Language | Description | Availability |
---|---|---|---|
Adamorobe Sign Language Lexicon Size: 250 signs |
Adamorobe Sign Language |
This lexicon contains 250 signs in isolation. For a subset of the signs, encodings about phonological and iconic features are available. The lexicon is available for download from the MPI Language Archive. |
Download |
Licence: Public |
British Sign Language (BSL) |
This lexicon was derived from the British Sign Language Corpus and is part of the ECHO case study on sign languages. The lexicon is available for download from the MPI Language Archive. |
Download |
Annotation: unannotated |
Chinese Sign Language (CSL) |
This lexicon demostrates how a Deaf adult signs a story to Deaf children. The lexicon is available for download from the MPI Language Archive. |
Download |
Czech Sign Language Corpus for Recognition – Amateur Signer Licence: ELRA |
Czech Sign Language | This is an amateur sign-language database comprising 25 signs from Czech sign language. 15 signers (4 women and 11 men) carried out 5 repetitions of each sign and were recorded from 3 different views. The first is a frontal view of the upper part of the body. The second one is similar, but with the camera placed about one meter higher than the first one so as to produce a frontal top-view, and thus allowing to detect 3D information. The last view is a frontal-detail view of the speaker's face, thus allowing lip-reading. | |
Czech Sign Language Corpus for Recognition – Professional Signer Size: 378 signs |
Czech Sign Language |
This lexicon comprises signs performed by 4 everyday sign-language users (4 women, 2 of them deaf). 5 repetitions of each sign were recorded from 3 different views. The first is a frontal view of the upper part of the body. The second one is similar, but with the camera placed about one meter higher than the first one so as to produce a frontal top-view and thus allowing to detect 3D information. The last view is a frontal-detail view of the speaker's face, thus allowing lip-reading. For the relevant publication, see ELRA (European Language Resources Association) |
|
Size: 300 signs |
Dutch Sign Language (NGT) |
This lexicon forms part of the ECHO case study on sign languages. The lexicon is available for download from the MPI Language Archive. |
Download |
ECHO NGT lexicon, Male signer 2 Size: 300 signs |
Dutch Sign Language (NGT) |
This lexicon forms part of the ECHO case study on sign languages. The lexicon is available for download from the MPI Language Archive. |
Download |
ECHO NGT lexicon, female signer 2 Annotation: unannotated |
Dutch Sign Language (NGT) |
This lexicon forms part of the ECHO case study on sign languages.The signer retells the fable The Shepherd Boy and the Wolf. The source of the retelling is a Dutch version of the fables by author Paul Biegel, consisting of approximately 300 words. The lexicon is available for download from the MPI Language Archive. |
Download |
Size: 1000 entries per language (video and text) |
French, Modern Greek (1453-), German, English, Modern Greek Sign Language, British Sign Language, German Sign Language, French Sign Language |
This is a multilingual lexicon in which concepts are linked to graphically represented signs and accompanying videos showcasing the signing process. The videos are annotated with HamNoSys ("Hamburg Sign Language Notation System"). The lexicon is available for online browsing via a dedicated interface. For the relevant publication, see Efthimiou, S-E. Fotinea, et al. (2010) |
Browse |
Size: 8,616 lemmas |
Greek Sign Language (GSL) |
This is an online dictionary of lemmas taken from three previously developed resources, namely (i) the NOEMA DB, from which it incorporates 3,000 revised entries, (ii) the GSL segment of the Dicta Sign Corpus, from which it incorporates 2,000 entries, and the POLYTROPON Parallel Corpus corpus, from which it incorporates 3,616 new entries. The lexicon is available for online browsing through an interface provided by the CLARIN:EL consortium. For the relevant publication, see E. Efthimiou, S-E. Fotinea, et al. (2019) |
Browse |
Size: 3,000 video entries |
Greek Sign Language (GSL) |
This dictionary contains video recorded signs paired with Modern Greek translations. The dictionary incorporates explanatory remarks that help non-native GSL users understand the meaning of the sign, while at the same time allowing for native GSL signers to enrich their Modern Greek vocabulary. The dictionary allows users to search by lemma, which means either by (i) hand shape, (ii) lemma classification according to syntactic category, or (iii) by the alphabetic ordering of the sign translations in Modern Greek. The dictionary is not available online. |
|
Size: 300 signs |
Swedish Sign Language (SSL/STS) |
This lexicon forms part of the ECHO case study on sign languages. The lexicon is available for download from the MPI Language Archive. |
Download |
Colophon
A working group created the page for this resource family with representatives of various CLARIN Knowledge Centers with expertise in SL resources:
K-Centre ACE: https://ace.ruhosting.nl
K-Centre :EL slt.ilsp.gr / https://www.clarin.gr/en/kcentre
K-Centre CLARIN-SMS https://sweclarin.se/eng/centers/stockholm
Contact person: henk.vandenheuvel [at] ru.nl (Henk van den Heuvel), K-Centre ACE