Organizations-Projects | Standards | Applications | Equipment | Terminology | Other Notes | References

Organizations-Projects (also initiatives, groups, etc.) [top]
AGD - Archiv für Gesprochenes Deutsch -
     by IDS (Institute für Deutsche Sprache) -
     projects: FOLK, DGD 2.0, FOLKER, GAT 2, REDE
AILLA "is a digital archive of recordings and texts in and about the indigenous languages of Latin America. 
     ...project was founded in 2000 ... The pilot web site and database, launched in April, 2000" 
ANC American National Corpus -
     The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development.
     When completed, the ANC will contain a core corpus of at least 100 million words, comparable across genres to the British National Corpus (BNC).
     The ANC has so far released 22 million words of American English, which is available from the Linguistic Data Consortium
     OANC (Open ANC) includes over 14 million words from the Second Release that can be freely distributed.
     XCES format is used. =>GATE (General Architecture for Text Engineering)
ANKOS Anadolu Üniversite Kütüphaneleri Konsorsiyumu
     2000 yılında 12 üye kurumun üç veritabanı aboneliği ile kurulmuştur.
     Açık Erişim ve Kurumsal Arşivler Grubu
APEnet (Archives Portal Europe) -
     is a Best Practice Network project supported by the European Commission in the eContentplus programme 
     and its objective is to build an Internet Gateway for Documents and Archives in Europe where 14 European National Archives in close cooperation with the EUROPEANA initiative 
     will create a common access point to European archival descriptions and digital collections. 
     "Participation in Archives Portal Europe will be open to all European archival repositories that can contribute descriptions of their holdings that are structured in accordance with the 
     appropriate international archival standards (either in EAD, EAC, EAG and METS format 
     or in a format that can be converted into EAD, EAC, EAG and METS preferably by themselves with the tools delivered by the project)."
ASEDA Aboriginal Studies Electronic Data Archive
     Corpora with about 300 Australian Indigenous languages including dictionaries, grammars and teaching materials.
BNC British National Corpus
     1991-, BNC have been deposited at the British Library Sound Archive.
     The corpus is encoded according to the Guidelines of TEI to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). 
     Metadata is a TEI-conformant header.
BROWN Brown Corpus (shortly), Brown University Standard Corpus of Present-Day American English
     "...pioneered the field of corpus linguistics in 1967 with 1 mil. words, 
     by now typical corpora (such as the COCA, the BNC or the ICE) tend to be much larger, in the order of 100 million words." -wiki
CECL Centre for English Corpus Linguistics
     Louvain-la-Neuve, Belgium
     project: ICLE - International Corpus of Learner English
CHILDES Child Language Data Exchange System 
     CHILDES is the child language component of the TalkBank system. 
     TalkBank is a system for sharing and studying conversational interactions. 
     CHAT Transcription Format
CIDOC - ICOMs (International Council of Museums) International Committee for Documentation ...was formed in 1950 as part of the first meeting of ICOM.
     they have a CRM standard (Conceptual Reference Model) for mainly cultural heritage
CLARA Common Language Resources and their Applications -
     " an ITN (Initial Training Network) in the Marie Curie Actions financed by the EC under FP7."
CLARIN (Common Language Resources and Technology Infrastructure)
     The ultimate objective of CLARIN is to create a European federation of existing digital repositories that include language-based data, to provide uniform access to the data, wherever it is, and to provide existing language and speech technology tools as web services to retrieve, manipulate, enhance, explore and exploit the data. 
     The primary target audience is researchers in the humanities and social sciences and the aim is to cover all languages relevant for the user community. The objective of the current CLARIN Preparatory Phase Project (2008-2010) is to lay the technical, linguistic and  organisational foundations, to provide and validate specifications for all aspects of the infrastructure (including standards, usage, IPR) and to secure sustainable support from the funding bodies in the (now 23) participating countries for the subsequent construction and exploitation phases beyond 2010. 
     a European research infrastructure (or network)
     specs (metadata, web services, etc.):
     =>D-Spin, WebLicht
CNRI Corporation for National Research Initiatives -
     "...a not-for-profit organization formed in 1986 to foster research and development for the National Information Infrastructure."
     "CNRI also houses a program of activities to support research in digital libraries and networked information technologies that includes 
     D-Lib Magazine." (1995-)
     => standards > Handle System (CNRI Persistant ID)
COCA Corpus of Contemporary American English
     1990- ,
daofind (2005-2006) and daofind+ (2007) Digitalisiertes Archivgut in Online-Findbüchern / Digitized Archives in Online Finding Aids Dezember 2007 abgeschlossen. -
     Ein Projekt des Bundesarchivs mit Unterstützung der Andrew W. Mellon-Foundation, New York
     ...Ziel des Projektes war es, in einer Pilotanwendung die kombinierte Nutzung der drei internationalen Standards EAD, EAC, METS...
     developed MEX (MidosaEditor for XML-Standards) software and some other tools.
DARIAH (EU, ESFRI-Project) -"
DBpedia -is the "Linked Data version of Wikipedia; a community effort to "Wikipedia Data Extraction" and to make this information available on the Web. - The DBpedia knowledge base currently describes more than 2.9 million things
     DBpedia uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. 
     We use the SPARQL query language to query this data.
     "DBpedia  ontology" consists of 205 classes which form a subsumption hierarchy and have altogether 1200 properties.
DCPSE Diachronic Corpus of Present-Day Spoken English 
     by UCL, selected spoken parts of LLC and ICE-GB
DC/DCMI Dublin Core Metadata Initiative -
DELAMAN Digital Endangered Languages and Musics Archives Network
     free membership, 23 participants (10.2009) including DoBes, ELAR, etc.
DiVA - Academic Archive On-line, is a publishing system for research and student theses and a digital archive for long-term preservation of publications. 
     DiVA began its development in the year 2000 at the EPC (Enheten för digital publicering) at Uppsala University Library. 
     Now (2009) 24 uni./inst. -
     see also: TEI, DocBook
DNB Deutsche Nationalbibliothek (German National Library)
DoBeS Dokumentation Bedrohter Sprachen / Documentation of Endangered Languages
     In 2000 the VolkswagenFoundation started the DOBES programme in order to document languages that are potentially in danger of becoming extinct within a few years time.
DocBook => standards.DocBook, see also: TEI, DiVA
DPC Digital Preservation Coalition
     acts as an agenda-setting "umbrella" body within the DP world... high-level goals...
     established in 2001, securing the preservation of digital resources in the UK and to work with others internationally
     -DP Training Programme (DPTP), courses -, 
     -Leadership Programme, Roadshows (1-day meetings)
     -publications repository at ULCC (Uni. of London Computer Centre):
D-SPIN (" the German contribution to the European =>CLARIN-Projekt")
     "...The aim of D-SPIN along with CLARIN is to establish a virtual research infrastructure based on available language resources and tools."
     WebLicht (Web-based LRT services) "is a Service Oriented Architecture (SOA) for building annotated text corpora. 
     WebLicht has been under development since October 2008 within the framework of the D-SPIN project."
     Participants: BBAW – Berlin, ASV – Uni. Leipzig, IDS – Mannheim, IMS – Uni. Stuttgart, Department of Linguistics – Uni. Tuebingen
     At the moment [2010] contains ca. 25 tools and resources, all for automatic annotation of text corpora. 
     These are tools like the following: Tokenizer, Part-of-Speech Tagger, Constituent-Parser, Semantic Annotators, etc.
     ...a standardised data format for creating text corpora has been defined. 
     It is widely compatible with existing formats (Negra, Paula, TüBa/DZ) and is translatable anytime via converters. 
EAGLES Expert Advisory Group on Language Engineering Standards -
     Initiative to support: the development of standards for  language resources, manipulation of  language resources and evaluation of resources, tools and products.
     EAGLES-I (1993-1995) 
     EAGLES-II (1997-1998)
     EAGLES WP4 - Survey of existing practices in the representation and annotation of dialogue HTML version (16th October 1998)
     recommendation: =>CES (Corpus Encoding Standard)
     This project then evolved to =>ISLE ("Now prosecuting with the ISLE project...") 
ECHO European Cultural Heritage Online
    ! North Carolina ECHO (Exploring Cultural Heritage Online) 
      project: - portal: - metadata:
      it is, as the motto "Connecting to Collections" tells, a portal to online special collections of North Carolina's libraries, archives, museums, historic sites, and other cultural institutions.
      They tried to construct their own system as a fork of LC (EAC, EAD etc.) 2002-05...
      these simple "adaptations" are made by their "Metadata Initiatives" (NCDC, PMDO, NCEAD, NCEAC)
ELAR The Endangered Languages Archive 
     is part of the Hans Rausing Endangered Languages Project (HRELP), based at the School of Oriental and African Studies (SOAS), University of London. 
     ...accepting deposits since 28th October 2005 (60 deposites 11.2009). "We intend to serve our catalogue in OLAC, IMDI and OAI compliant metadata."
     Elar resources:
     Elar catalogue:
ELDA / Elda S.A. Evaluations and Language resources Distribution Agency
     -operational/distributor body of ELRA, France 
ELRA European Language Resources Association
     Catalogue of Language Resources: and
     ELDA UC (Universal catalogue)
E-MELD Electronic Metastructure for Endangered Languages Data - (see ILIT for actual, 2009 state)
     2001-07, with 5 project participants: Wayne State University, Eastern Michigan University, The University of Arizona, The Linguistic Data Consortium, The Endangered Language Fund
     E-MELD Grant Proposal:
     metadata server at ...BUT:
     "... In order to reach the widest possible audience, E-MELD changed its project goal of mounting a central metadata server 
     into implementing an OLAC service provider." 
     GOLD linguistic ontology being developed as part of the E-MELD project at the University of Arizona
     ORE (OLAC Repository Editor) and some other mostly online tools...
ENABLER (Network) European National Activities for Basic Language Resources
     The initial Network Members are: Università degli Studi di Pisa (UPI - Italy), Institute for Language and Speech Processing (ILSP - Greece), European Language Resources-Distribution Agency (ELDA S.A. - France), Center for Sprogteknologi (CST - Denmark), Nederlandse Taalunie (NTU - Netherlands), Institut d’Estudis Catalans (IEC - Spain), Ludwig Maximilians Universität München (LMU - Germany), Fundação da Universidade de Lisboa – Centro de Linguística da Universidade de Lisboa (FUL-CLUL - Portugal), Faculty of Mathematics and Physics, Charles University in Prague (MFF UK - Czech Republic, Beltext, Université de Liège (ULg - Belgium), Department of Speech, Music and Hearing (KTH - Sweden),Real Academia Española (RAE - Spain), Instituut voor Nederlandse Lexicologie (INL - Netherlands)
ePrints UK Project - - "Open [Scholarly] Archive"...
     see also UKOLN - University of Bath (ukoln-website publish most of the documentation.)
     "Current practice for repositories is to expose simple DC records over OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) [12] as mandated by that protocol. 
     "However, it is widely agreed that simple DC has limitations that pose problems for repository developers and aggregator services. 
     Issues relating to normalised names, use of controlled subject vocabularies or other authority lists, dates and identifiers are common and many were identified in the course of our functional requirements gathering."
     Eprints Application Profile (presentation, documentation 
        "In May 2006, the Joint Information Systems Committee (JISC) [1] approached UKOLN [2] and the Eduserv Foundation [3] to collaborate on the development of a metadata specification for describing eprints (alternatively referred to as scholarly works, research papers or scholarly research texts)."
        In order to build up a "DC application profile" for scholarly publications...
        FRBR is used as the basis of the model, some of the entity and relationship labels used in FRBR have been modified for this model, in order to make them more intuitive to those dealing with eprints and to align them with the terminology used in DC
        (for instance, ...) a simple vocabulary for use with the dc:type property in the Scholarly Works Application Profile, consisting of the classes listed below: 
        ScholarlyText, Book, BookItem, BookReview, ConferenceItem, ConferencePaper, ConferencePoster, JournalItem, JournalArticle, NewsItem, Patent, Report, SubmittedJournalArticle, Thesis, WorkingPaper
     see EPrints software too.
Ethnologue -
     "An encyclopedic reference work cataloging all of the world’s 6,909 known living languages" is a collaboration between universities, research institutes and content providers. 
     The site is a prototype, and in the coming year we will be developing it in response to users' feedback and making it operational. 
HRELP Hans Rausing Endangered Languages Project - 
     3 programmes: ELAR arvchive, The Endangered Languages Academic Programme (ELAP) running PhD and Masters courses in language documentation, 
     and The Endangered Languages Documentation Programme (ELDP) which is primarily a funding body.
     SOAS -The School of Oriental and African Studies - University of London
ICA International Council on Archives -
     standards: ISAD(G) and ISAAR(CPF) - International Standard Archival Authority Record (for Corporate Bodies, Persons, and Families)
          ISAD(G) General International Standard Archival Description
     "Membership of ICA gives you the opportunity of joining a network of over 1,169 archival institutions and 211 individual members from more than 190 countries."
ICADS - IFLA-CDNL (Conference of Directors of National Libraries) Alliance for Digital Strategies
     ICADS was established in August 2008 as a successor to ICABS (IFLA CDNL Alliance for Bibliographic Standards)
ICE International Corpus of English
     1990- East Africa, Great Britain, Hong Kong, India, Ireland, Jamaica, New Zealand, The Philippines, Singapore...
     ICE-GB is the British component of the International Corpus of English (ICE), by UCL 
     ICE-GB was first released in 1998 with ICECUP 3.0 (Corpus Utility Program for searching)
ICLE - International Corpus of Learner English (by CECL)
ICOMs International Council of Museums -
     on this context, see CIDOC
IDF International DOI Foundation 
     1998-, RAs = DOI Registration Agencies,
IASA International Association of Sound and Audiovisual Archives
     see Bradley, Kevin (ed.), "Guidelines on the Production and Preservation of Digital Audio Objects", 2nd ed. 2009 -
IDS (Institute für Deutsche Sprache) -
     see also AGD - Archiv für Gesprochenes Deutsch -
IFLA International Federation of Library Associations and Institutions
     FRBR (functional requirements), FRANAR => FRAD, FRSAR => FRSAD, ISBD (...), etc.
ILIT (and LinguistList)- Institute for Language Information and Technology -
     an autonomous research center at Eastern Michigan University
     LinguistList is an OLAC service provider = gateway = search facility
     1990: founded by Anthony Rodriques Aristar  at the University of Western Australia  with 60 subscribers; 1991: moved to Texas A&M University; 1991: Eastern Michigan University was established as the main editing site for the list.
     1997: moved from Texas A&M to its own site (where it has been ever since); 1998: Wayne State University in Michigan was established as the second editing site for the list.
     2006: projects consolidated and became the Institute for Language Information and Technology (ILIT), an autonomous research center at Eastern Michigan University
ILSP Institute for Language and Speech Processing, Greece
INTERA Integrated European language data Repository Area - -
     funded by the EU, started in January 2003 and lasted 2 years (Last event at their page: November 2004, Acces Management Workshop, Nijmegen.)
     later evolved to IMDI + OLAC + DC ... 
     "The INTERA project has essentially two pillars: (1) to build an integrated European language resource area (with help of metadata descriptions and the interlinking of the resulting distributed resource repository) by connecting international, national and regional data centers.
     (2) to produce new multilingual language resources. ("The second goal addresses the lack of quality of multilingual resources, especially for less widely spoken languages, including Balkan ones, which are of crucial importance to the development of the eContent business. INTERA goes further ahead by developing examplary methods for their business attractive production.")
     The project will essentially: deliver an integrated searchable and browsable European domain of metadata descriptions of language resources housed by various data centers... etc.
     Workpackages and The Results: WP1 : Project Management
     WP2 : Integrated Resource Domain <= last document at their page...
     WP3 : Standardized Descriptions <=  "all IMDI categories were entered into the ISO TC37/SC4 Data Category Registry, the IMDI set is now available in 8 languages (D, F, E, NL, Se, I, Gr, Sp)"
     ...WP7 : Dissemination and exploitation
IPA International Phonetic Association -
     provides a notational standard for the phonetic representation of all languages, International Phonetic Alphabet (also IPA).
     The latest version of the IPA Alphabet was published in 2005.
ISLE International Standards in Language Engineering - - old/sub mirror:
     The project is a cross-atlantic collaboration (MPI öncülüğünde, and is a continuation of the EAGLES initiative, started on 1.2000 (Last activity: 13-Dec-2001)
     It aims at developing and promoting Human Language Technologies (HLT) standards and guidelines for language resources, tools and products.
     Released =>IMDI Schema 1.8
ISOcat DCR is the Data Category Registry for ISO TC 37
ITHAKA was founded in 2003 and received start-up funding from The Andrew W. Mellon Foundation, the William and Flora Hewlett Foundation, and the Stavros S. Niarchos Foundation.
     PORTICO - (part of ITHAKA)
     "A Digital Preservation and Electronic Archiving Service" preserves scholarly literature published in electronic form – more than 14,000 electronic journals and books – and ensures that these materials remain accessible to future scholars, researchers, and students.
JSTOR - was conceived by William G. Bowen, then-President of the Andrew W. Mellon Foundation, to help university and college libraries...
     In 1995, following a pilot launched under the direction of the University of Michigan, JSTOR was established...
     is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive.
     in 2009, merged with and became a service of ITHAKA.
     2009 (stats JSTOR has digitized more than 27 million pages of printed literature. We digitize on average 400,000 new pages each month, growing the archives by over 4.5 million pages annually.
KEEP (Keeping Emulation Environments Portable) will (2.2009-) develop an Emulation Access Platform to enable accurate rendering of both static and dynamic digital objects: text, sound, and image files; multimedia documents, websites, databases, videogames etc. 
     The overall aim of the project is to facilitate universal access to our cultural heritage by developing flexible tools for accessing and storing a wide range of digital objects. 
     Project KEEP is co-financed by the European Union’s Seventh Framework Programme for research and technological development (FP7).
     related EU PROJECTS: The results of other European projects such as Prestospace or Caspar as well as the International Internet Preservation Consortium (IIPC) will be taken into account. More particularly, KEEP has a strong relation to the outcomes of European projects Planets and SHAMAN: currently, Planets is developing a permanent access framework which will take care of retaining access to digital objects by performing preservation planning, characterisation and follow-up actions such as migration and emulation.
kopal - co-operative Development of a Long-Term Digital Information Archive
     The German National Library, the Niedersaechsische Staats- und Universitaetsbibliothek Goettingen SUB (State and University Library), GWDG (the Gesellschaft fuer wissenschaftliche Datenverarbeitung mbH Goettingen), and IBM Deutschland GmbH are working in close cooperation on 
     a technological solution for long-term archiving and more importantly, the preservation of digital documents
     Object Format und Detail Specification: DIAS-Core (Digital Information Archiving System) -
     Implementation of DIAS-Core Interface: kopal tools ( " ), koLibRI - kopal Library for Retrieval and Ingest.
     standards: kopal's UOF (Universal Object Format), with DC, LMER, METS, OAIS, URN
Lacito -
     The LACITO Archive provides free access to documents of connected, spontaneous speech, mostly in "rare" or endangered languages, recorded in their cultural context and transcribed in consultation with native speakers. 
     Its goal is to contribute to the documentation and study of a precious human heritage: the world's languages. 
     At present, the archive contains some 200 documents in 44 languages, annotated by some twenty specialists
LAN Language Archives Newsletter
     began under the auspices of DOBES  at the MPI
LDC Linguistic Data Consortium (at Uni. of Pennsylvania)
     a repository and catalogue of data providers and corpus authors.
     projects, including ANC (American National Corpus), tools (AGTK etc.)
     "The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes." -
LEAF Linking and Exploring Authority Files
     is a three year EU project that started in March 2001.
     LEAF project (2001-4, for uploading distributed authorities (persons and corporate bodies) to a central system and automatically links those authorities concerning the same entity.
     "For the authority records the EAC schema/DTD will be used." 
LinguistList - - see ILIT
LLC London-Lund Corpus (aka 'Survey Corpus') by UCL
LOC Library of Congress 
     Resource Description Formats: MARC 21, EAD, etc. / Digital Library Standards: Mets, etc.
LT World (Language Technology World) is the most comprehensive WWW information service and knowledge source 
     on the wide range of technologies that deal with human language. 
     The service is provided by the German Language Technology Competence Center at DFKI.
META-NET A Network of Excellence forging the Multilingual Europe Technology Alliance -
     META Kick-Off - February, 22nd+23rd, 2010 - Berlin, Germany
     Founding Members: (DFKI) Germany, (BM) Spain, (CNR) Italy, (ILSP) Greece, (CUNI) Czech Republic, (CNRS) France, (UU) The Netherlands, (Aalto) Finland, (FBK) Italy, (DCU) Ireland, (RWTH) Germany, (JSI) Slovenia, (ELDA), France 
METU Spoken Turkish Corpus Project = ODT-STD = ODTÜ Sözlü Türkçe Derlemi (TÜBİTAK 108K283)
     Prof. Dr. Şükriye Ruhi -Proje Yürütücüsü 
     Thomas Schmidt ve Kai Wörner (EXMARaLDA programlarını geliştiren Hamburg Üniversitesi Çokdillilik Araştırmalar Merkezi çalışanları) de projesi ekibinin bir parçasıdır.
     ...amacı, yüz yüze ya da çeşitli iletişim araçlarıyla (örn., telefonda ve kitle iletişim araçları) gerçekleşen Türkçe konuşmalardan oluşacak 1 milyon sözcüklük bir veritabanını dilbilimsel yöntemlerle çözümleyerek günümüz Türkçesinin bilgisayar ortamında izlenmesini ve taranmasını sağlayacak bilimsel bir kaynak oluşturmaktır.
     ODT-STD’de konuşmalar EXMARaLDA  programları yardımıyla yazıya dökülecek, bu yazılar ses/görüntü kayıtları ile birlikte eşzamanlı olarak izlenebilecek ve konuşma Türkçesinin çeşitli söylem özellikleri işaretlenecektir (İng. tagging). Böylelikle, örneğin, derlemde hitap biçimleri, istekte bulunma, teşekkür etme vb. dil kalıplarını kullanım ortamları içinde incelemek mümkün olacaktır.
     ODT-STD, TÜBİTAK tarafından desteklenmekte olup, ilk küçük çaplı deneme ürününün 2010 yılı başında, Ekim 2010’da ise konuşmaların yazıya çevrilmiş hali ile kullanıma sunulması planlanmıştır. Bundan sonraki yıllarda ise, derlemin boyutunun büyütülmesi öngörülmektedir.
     * Gönüllü:
     * Bağış (kayıt devretmek):
METU Turkish Corpus - -
     a collection of 2 million words of post-1990 written Turkish samples.
        Not: Bu derleme/collection referans vermek için lütfen aşağıdaki makaleyi kullanın: 
        Bilge Say, Deniz Zeyrek, Kemal Oflazer, Umut Özge. “Development of a Corpus and a Treebank for Present-day Written Turkish”, (Proceedings of the Eleventh  International Conference of Turkish Linguistics, August, 2002)  İmer, Kamile and Gürkan Doğan (eds),  Current Research in Turkish Lingustics,  pp.183-192, Eastern Mediterranean University Press, 2004. 
     "The METU-Sabancı Treebank is a subcorpus of the METU Turkish Corpus (Atalay, Oflazer, and Say, 2003; Oflazer etal., 2003) that includes material taken from three daily newspapers, 87 journal issues and 201 books. The treebank has 5620 sentences and 53,798 tokens." 
     " a morphologically and syntactically annotated treebank corpus of 7262 grammatical sentences." -
        Not: Bu derleme referans vermek için lütfen aşağıdaki makaleleri kullanın: 
        Kemal  Oflazer, Bilge Say,  Dilek Zeynep Hakkani-Tür, Gökhan Tür,  Building a Turkish Treebank, Invited chapter in Building and Exploiting Syntactically-annotated Corpora, Anne Abeille Editor, Kluwer Academic Publishers, 2003.
        Nart B. Atalay, Kemal Oflazer, Bilge Say, The Annotation Process in the Turkish Treebank, in Proceedings of the EACL Workshop on Linguistically Interpreted Corpora - LINC, April 13-14, 2003, Budapest, Hungary.
METU "Audio Corpora ?" - METU 1.0 (Middle East Technical University Turkish Microphone Speech Corpus) : "Project: The corpus has been collected at the Middle East Technical University (METU) as part of a collaborative work between the Department of Electrical and Electronics Engineering of the Middle East Technical University in Turkey and the Center for Spoken Language Research (CSLR) of the University of Colorado at Boulder, USA. The collaboration has been supported by TUBITAK, the Scientific and Technical Research Council of Turkey, through a combined doctoral scholarship program. The corpus was used to port the Speech Recognition System of CSLR, SONIC, to Turkish."
     "Phonemes are represented in METUbet form.  For more detailed information on the phonetic alphabet used,... Salor 2002"
MPI Max Planck Institute [for Psycholinguistics]
     BC (Browsable Corpus), ECHO (European Cultural Heritage Online), Intera, ISLE, DoBeS, LAT tools, etc.
Nestor - is the network of expertise for the long-term storage and accessibility of digital resources in Germany.
     see also KOPAL -
NIST National Institute of Standards and Technology (US)
OAI Open Archives Initiative - 
     "end user" <--- publish --- "service-providers" <--- harvested --- "data-provider" <--- preserved and metadata --- "resources"
     Participating archives or linguists, known as "data-providers," describe their resources using the OLAC metadata set, which is based on the 15-element Dublin Core. Search facilities like The LINGUIST List, known as "service-providers," "harvest" metadata records through periodic http requests. 
     see also sub-community OLAC, E-MELD project, LinguistList Gateway...
     standards: OAI-PMH and OAI-ORE
OAIS Open Archival Information System, consists of an organization of people and systems, including NASA, 
     that has accepted the responsibility to preserve information and make it available for a Designated Community.
     recommends a reference system / model
OASIS Organization for the Advancement of Structured Information
     historically, SGML Open consortium. Major commercial sponsors: SUN, IBM
     Implementations: OpenOffice 2.0; Star Office, IBM-Workplace-Shell, TYPO3, Google Docs, 
     DocBook => DocBook
     DITA (Darwin Information Typing Architecture) - "Serna XML Editor Features DITA XML format Support " (
     UBL Universal Business Language 
     (Tr'den Asuman Dogac,, Chair 
      - SRDC Yazılım Araştırma & Geliştirme ve Danışmanlık -
        aynı şirkette: Dr. Gökçe Banu Laleci Ertürkmen
OCLC Online Computer Library Center
OLAC Open Language Archives Community -
     is a sub-community of the Open Archives Initiative, for metadata set ...
     OLAC-OAI (versus/with DC): "There is no need for OLAC data providers to store the records in both formats, however, 
     since the information in the oai_dc format is a subset of the information in the olac format."
PADI (The National Library of Australia's) Preserving Access to Digital Information
     PADI is managed by the National Library of Australia with the support of DPC (Digital Preservation Coalition) & nestor
     technical notes (metadata based on DC, Oracle/Java servlet, etc.):
PICS Platform for Internet Content Selection 
     a cross-industry working group whose goal is to facilitate the development of technologies to give users of interactive media, such as the Internet, control over the kinds of material to which they and their children have access.
PKP Public Knowledge Project
     many open souece tools/applications like Open Journal Systems, Open Conference Systems, Open Archives Harvester
PORTICO began as the Electronic-Archiving Initiative launched by JSTOR in 2002 with a grant from The Andrew W. Mellon Foundation;
     was launched in 2005 with additional support from JSTOR, Ithaka, The Library of Congress, and The Andrew W. Mellon Foundation.
     12.2009: 14,444,764 preserved articles, 87 participating publishers, 39,329 committed e-journal and e-book titles
Romani Project – a cluster of academic research activities based at the School of Languages, Linguistics and Cultures at the University of Manchester.
     Romani Linguistics and Romani Language Projects. nice layout, presentation...
Rosetta Project -
     " a global collaboration of language specialists and native speakers working to build a publicly accessible digital library of human languages. 
     Since becoming a National Science Digital Library collection in 2004, the Rosetta Archive has more than doubled its collection size, 
     now serving nearly 100,000 pages of material documenting over 2,500 languages—the largest resource of its kind on the Net."
SAA Society of American Archivists -
     mainly a community, supports EAD but no research or standard
     Founded in 1936, the  is North America's oldest and largest national archival professional association 
     ...more than 5,000 individual and institutional members
ScanDiaSyn (Scandinavian Dialect Syntax) - Nordic Dialect Corpus -
SIL [initially] Summer Institute of Linguistics -
     Toolbox etc. software (, fonts, etc. ...
TEI Text Encoding Initiative (TEI) -
     see also: DocBook, DiVA
TIGER project was - - funded by the Deutsche Forschungsgemeinschaft (DFG) from 1999 to 2004.
     The aim of the project was the creation of a large syntactically annotated corpus of German newspaper text.
TKD Türk Kütüphaneciler Derneği (TURKISH LIBRARIAN'S ASSOCIATION) ICA üyesi
     Necatibey Cad, Elgün Sok, 8/8, KIZILAY 06440
     Telephone: 90 3 1223 01325 Fax: 90 3 1223 204 53
UCL University College London 
     (UCL secured 174 awards from 651 applications in the 2008-09 financial year, compared with the 162 secured by Cambridge and the 153 won by Oxford.)
     Survey of English Usage: London-Lund Corpus (LLC), ICE-GB of ICE, DCPSE, etc. participated ICLE, etc.
UHLCS University of Helsinki Language Corpus Server. 
     a multilingual corpus server containing corpora of more than 50 languages, including samples of minority languages and extensive corpora representing different text types.
UKOLN - University of Bath -
     see Cidoc-CRM standard and ePrints for details, but this ukoln-website is rich for the documentation.
VIVALDI - VIVaio Acustico delle Lingue e dei Dialetti d'Italia (1998-)
     Humboldt-Universität Berlin, Institut für Romanistik
     "Eine Ausnahme bildet der 1998 erschienene Atlas des Dolomitenladinischen und angrenzender Dialekte (ALD), zu dessen Lieferumfang (neben 4 Karten- und 3 Indexbänden) auch 3 CD-ROM gehören, auf denen das gesamte Sprachmaterial der für den Atlas aufgenommenen 21 zentralladinischen Orte abgespeichert ist."
     "möglichkeit des Vergleichs der aktuellen Daten mit den Daten des 
     AIS (Sprach- und Sachatlas Italiens und der Südschweiz) und des ALI (Atlante linguistico italiano)" - 
VRA Visual Resources Association -
     NA-North American...
     VRA Core 4.0 is a [meta]data standard for the cultural heritage community
XÖV - XML in der öffentlichen Verwaltung
     Deutschland-Online etc. etc. -
WebLicht is a Service Oriented Architecture (SOA) for building annotated text corpora. 
     WebLicht has been under development since October 2008 within the framework of the 
     => D-SPIN project. -
     => TCF TextCorpus format

S t a n d a r d s (metadata etc.) [top]
* Metadata - General resources and indices - IFLA -
* discussion: International Standards for Digital Archives - - Berlin 24.-27.04.2007
CES Corpus Encoding Standard, from EAGLES project, latest 2000! 
     Ide, N., Priest-Dorman, G. and Véronis, J. (1996) EAGLES recommendations on corpus encoding . EAGLES Document EAG-TCWG-CES/R-F. Version 1.4, October, 1996. 
     .corpus / "CES selected the relevant elements from TEI for corpus encoding." / => XCES: XML Version of the CES
CHAT Codes for the Human Analysis of Transcripts 
     transcription format used for the CHILDES project.
CIDOC CRM (Conceptual Reference Model) 
     was developed by the ICOM/CIDOC Documentation Standards Group. Since September 2000, the CIDOC CRM is being developed into an ISO standard in a joined effort of the CIDOC CRM SIG and ISO/TC46/SC4/WG9. 
     2001-2003 EU project CHIOS project ( CRM final technical documentation
     Version 4.2.1: October  2006. Since September 2006 it has been accepted as ISO 21127. Version 5.0.1: November 2009
     It represents an 'ontology' for cultural heritage information i.e. it describes in a formal language the explicit and implicit concepts and relations relevant to the documentation of cultural heritage.
     90 classes/entities (E1 CRM Entity, "This class comprises all things in the universe of discourse  of the CIDOC CRM..."), 
     148 properties
     ...close relation with IFLA, their FRBR etc... and
CLARIN =>see project - specs (metadata, web services, etc.):
CNRI Persistant ID => Handle System
corpus standards => TigerXML, XCES, CES (EAGLES Corpus Encoding Standard)
     also => glossary > corpus
DBpedia ontology --> DBpedia (and
     "...DBpedia  ontology consists of 205 classes which form a subsumption hierarchy and have altogether 1200 properties."
DC / DCMS / DCMES / DCMT / [DCTERMS] Dublin Core Metadata "Element Set/Terms" (of 15 elements for use in resource description)
     ISO Standard 15836:2009
     The name "Dublin" is due to its origin at a 1995 invitational workshop in Dublin, Ohio; 
     "core" because its elements are broad and generic, usable for describing a wide range of resources." -official description, (2008-01-14)
     1995: "This memo provides information for the Internet community... 
            Introduction - Finding relevant information on the World Wide Web has become increasingly problematic due to the explosive growth of networked resources." 
            -S. Weibel 1998. Dublin Core Metadata for Resource Discovery -
     1996-98: "The current list of elements and their general definitions were finalized in December 1996." 
     -Dublin Core Metadata Element Set, Version 1.0 -
     Using DC:
     ...In July of 2000 DCMI issued its list of recommended DC Qualifiers
          DC Qualifiers (has a summary table).
     ...replaced by  DCMI Elements and Element Refinements - a current list
     ...replaced by DCMI Metadata Terms -2003
     ...replaced by LATEST DCMI Metadata Terms -2008-01-14
     dc: ==> legacy, Metadata Element Set, Version 1.1 [DCMES] 
          "/elements/1.1/" namespace (e.g.,
     dcterm: ==> more comprehensive, DCMI Metadata Terms [DCTERMS] 
          "/terms/" namespace (e.g.,
     "Implementers may freely choose to use these fifteen properties either in their legacy dc: variant or in the dcterms: variant depending on application requirements. 
     Over time, however, implementers are encouraged to use the semantically more precise dcterms: properties."
     "Levels of interoperability" 
     / closed system, Level 1 solutions DCMI Metadata Terms 
     / exchangeability of metadata with "the rest of the world" is important, consider Level 2; 
     interoperability among metadata-using applications is based on 
     the shared formal model provided by RDF, which is used to support "Linked Data" (, also see DBpedia). 
     / level 3-4 (
     => Application Profile, see ePrint
     => Description Set Profiles: A constraint language for Dublin Core Application Profiles [DCMI Working Draft]. This document specifies an application-independent language for the constraints used in defining "templates" for metadata records -- for example, to specify a template for records which describe exactly "one book", with "up to ten authors", using "subject headings from the Library of Congress".
     "Repositories are not required to store their metadata in DC; DC is something that is frequently "converted to", rather than "stored in". Many repositories store their metadata in some other format, and dynamically convert to DC in response to harvester requests." -
DIAS Digital Information Archiving System -
     Object Format und Detail Specification, see kopal project 
DocBook ( - "is a semantic markup language for technical documentation. As a semantic language, DocBook enables its users to create document content in a presentation-neutral form that captures the logical structure of the content; that content can then be published in a variety of formats, including HTML, XHTML, EPUB, PDF, man pages and HTML Help, without requiring users to make any changes to the source.")
     useful docbook resources: - 
                               - ("Why SGML DocBook is dead"...)
DOI Digital Object Identifier (not "Identifier of a Digital Object") 
     by IDF (The International DOI Foundation), a non-profit organisation created in 1998.
     "DOI names are an implementation of the  => CNRI Handle System"
     DOI name: a digital identifier for any object of intellectual property. Application Profiles (APs)
     Kernel Declaration => Referent Metadata Declaration (RMD) for metadata interchange 
     implements object > AP (application profile) => [grid] service instances (Data Model and APIs:
     indecs/central Data Dictionary (iDD) is under development as the repository for all data elements and allowed values used in Kernel Metadata declarations and Referent Metadata Declarations (RMDs).
EAC-CPF Encoded Archival Context – Corporate bodies, Persons, and Families - Tag list:
     EAC-CPF is an XML standard, a communication structure for archival contextual information for (encoding information about) for agents (individuals, corporate bodies and families). 
     It supports the exchange of ISAAR (CPF) compliant authority records. 
     It can be used in conjunction with Encoded Archival Description (EAD) for enhancement of EAD's capabilities in encoding finding aids, 
       but can also be used in conjunction with other standards or for standalone authority file encoding.
     EAC began with a 1998 effort by Richard Szary, Wendy Duff, and Daniel Pitti to envision a standard for encoding and exchanging authoritative information about the context of archival materials.
     ...founded in Toronto meeting, Canada, in March 2001, is maintained by the SAA in partnership with the Berlin State Library
     EAC elements reflect the ISAAR(CPF) and the ISAD(G) standards managed by the ICA.
     "The EAC Working Group released the EAC-CPF schema and tag library on August 21, 2009"
     see also LEAF project (2001-4, for uploading distributed authorities (persons and corporate bodies) to a central system.
EAD Encoded Archival Description (from LOC) 
     1993-, version 1 was issued in 1998, ver. 2002 
EAG Encoded Archival Guide Document Type Definition
     It is created by the Spanish Ministry of Culture and it is largely applied in the following implementation:
ELAR The Endangered Languages Archive, drawing on OLAC terminology, requires at least a minimum of metadata describing each resource to be provided as part of your deposit.
ELRA info on Metadata and Catalogue
Eprints Application Profile - not a standard but a "açımlayıcı? profile" within DC framework
     see ePrints project
EUDICO annotation format (*.eaf)
EUPL European Union Public Licence -  v.1.1 (9 January 2009) -
     The EUPL is the first European Free/Open Source Software (F/OSS) Licence.
FAST Faceted Application of Subject Terminology (by LOC)
     is an adaptation of the LOC Subject Headings (LCSH) with a simplified syntax. -
     "...Currently [end of 2009] the FAST authority file contains over 1,600,000 authority records. 
     The FAST database is available at"
FRAD Functional Requirements for Authority Data (by IFLA's Working Group on FRANAR)
     20 June 2009 IFLA publication: 
     "Functional Requirements for Authority Data - A Conceptual Model". Edited by Glenn E. Patton, 
     München: K.G. Saur. (IFLA Series on Bibliographic Control; Nr. 34) ISBN 978-3-598-24282-3, Euro 59.95
FRBR Functional Requirements for Bibliographic Records (by IFLA, Final Report of 1997; München : K.G. Saur, 1998; 
     as amended and corrected through February 2009)
     see also: FRANAR => FRAD and, FRSAR => FRSAD
     This original sometimes called FRBR-ER (in constract to FRBRoo), entity-relationship definition.
FRBRoo (object-oriented version, harmonised with CIDOC CRM) -
     ...a formal ontology attemp intended to integration, mediation, and interchange of bibliographic and museum information (CIDOC approach).
     1.0 (draft) of the May 2009.
FRANAR Func.Req. of Authority Numbering and Records => FRAD
FRSAD Functional Requirements for Subject Authority Data -is the new name for FRSAR
     see also FAST which is an adaptation of the LOC Subject Headings (LCSH) with a simplified syntax.
FRSAR Functional Requirements for Subject Authority Records => FRSAD
GAC Geographic Area Code (MARC code 043) -
GATE (General Architecture for Text Engineering) utilized at ANC (Am.Nat.Corpus) project
     see also GrAF (Graph Annotation Framework)
GOLD General Ontology for Linguistic Description
     linguistic ontology based on OLAC, being developed as part of the E-MELD project at the University of Arizona
     online tools etc.: SIL (Semantic Interpretation Language), FIELD (Field Input Environment for Linguistic Data)
GrAF (Graph Annotation Framework)(Ide & Suderman,2007) A Graph-based Format for Linguistic Annotations, ISO_TC_37
     (Ide-Suderman-2nnn-Bridging-the-gaps-interoperability-for-GrAF-GATE-and-UIMA.pdf: "GrAF has been developed by the International StandardsOrganization(ISO)’sTC37SC4,asa partoftheLinguisticAnnotationFramework(InternationalStandardsOrganization,2008).GrAF providesanXMLserializationofanabstractdata modelforannotationsthatisintendedtoserveas a”pivot”fortransducingamonguser-definedand toolinputannotationformats.GrAFisintendedto functioninmuchthesamewayasan interlingua inmachinetranslation:acommon,abstractconceptualrepresentationintoandoutofwhich user andtool-specificformatsaretransduced,sothat atransductionofanyspecificformatintoandout ofGrAFaccomplishesthetransductionbetween itandanynumberofotherGrAF-conformant formats. GrAF is currently an ISO Candidate Draft.")
     see also =>LAF 
Handle System - an infrastructure (not end user or off-the-shelf software) developed by CNRI -
     1994- history: Handle System Overview -
     also => DOI: "The International DOI Foundation's implementation of handles, the DOI®  System, has over 40 million registered handles."
     Early adopters of the Handle System included the Library of Congress, the Defense Technical Information Center (DTIC), and the International DOI Foundation (IDF).
     "Handle System is a general-purpose global name service that allows secured name resolution and administration over networks such as the Internet. 
     The Handle System manages handles, which are unique names for digital objects and other Internet resources." -
     Every handle in the Handle System consists of two parts: its [NamingAuthority], followed by a unique [LocalName] under the naming authority, separated by "/". 
     - via RSPs (Resolution Service Provider) Example resolving "10.1045/july95-arms": 
        -> client first sends a query to the GHR (Global Handle Registry) for the "naming authority" (responsible for handles under the "10.1045"); 
        <- GHR (which registers and maintains "naming authority handles") returns the service information of the LHS (Local Handle Service);
        -> client now sends request to LHS (which is responsible for handles under the naming authority "10.1045")
        <- LHS responds the result
IMDI (ISLE Metadata Initiative)
     element set (metadata framework) development began with ISLE, then maintained by IMDI-Team backed mainly by MPInst., Nijmegen  
     latest sets (session, cat, etc.) are 3.x, but no open doc, only schemas
     "IMDI set 3.0 was  designed as part of the INTERA and DOBES projects and is available as an  XML- Schema." -Broeder, D., & Offenga, F. (2004). Language Archive Newsletter, 1(2), 3-3.
     May 2000 - LREC 2000, Athens, Greece
          LREC (The International Conference on Language Resources and Evaluation) -
          "is organised by ELRA biennially with the support of institutions and organisations involved in Human Language Technologies (HLT)"
     Wittenburg, P., Broeder, D., and Sloman, B., (2000), EAGLES/ISLE: A Proposal for a Meta Description Standard for Language Resources, White Paper. LREC 2000 Workshop, Athens.
     2000 Nov.: ver 2.0
     Projects using IMDI (as 2007) * CGN - The Spoken Dutch Corpus Project * CNGT - Sign Language of the Netherlands * DAM-LR - Distributed Access Management for Language Resources
    * DBD - Dutch Bilingualism Database * DOBES - Documentation of Endangered Languages * ECHO Case Study 4 - Sign Languages
    * ECHO WP2 - Infrastructure and Technology * INTERA WP2 - Integrated European language data repository Area
IPA (International Phonetic Alphabet) a notational standard for the phonetic representation of all languages, 
     provided by International Phonetic Association (also =>IPA) -
     see also =>phonetic alphabet
ISAAR(CPF) - International Standard Archival Authority Record (for Corporate Bodies, Persons, and Families) - 2nd ed. 2004
     by ICA - - see also EAC-CPF
     from Archivists’ Toolkit: "...choosing the international standard of ISAAR (CPF) over the national standard of AACR/MARC will force some modification of MARC records in which a conference name is encoded as a corporate name."
ISAD(G) General International Standard Archival Description - 2nd ed. 1999
     defines the elements that should be included in a archival finding aid. see also ISAAR(CPF)
     by ICA - (
ISADN International Standard Authority Data Number(s) 
     a suggestion from IFLA -
     see VIAF system
ISBD International Standard Bibliographic Descriptions (by IFLA, 2007)
ISOcat DCR is the Data Category Registry for ISO TC 37. 
     Right now the DCR contains metadata elements from  the IMDI and part of the OLAC set.
ISO 639-2 2-letter identifiers for languages
ISO 639-3 is a code that aims to define three-letter identifiers for all known human languages.
ISO/TC 37-SC dataformats
     LAF: Linguistic Annotation Framework
     LMF: Lexical Markup Framework
     MAF: Morpho-Syntactic Annotation Framework
ISO/IEC 11179 Metadata Registry (MDR) standard
     "comes closest to addressing the semantic properties of documents, databases, and other resources, but does not establish a framework for representing rules relevant to terminology usage and conflict resolution among other issues."
     The reason that ISO/IEC 11179 does not seem to fill any real world commercial requirements for data exchange is probably that it is a purely theoretical construct, based on semantic theory. -wikipedia
kopal's UOF (Universal Object Format) - 2006 version:
     1 object = 1 zipped file - An archiving and exchange format for digital objects, based on the formats METS (Version 1.4) and LMER (Version 1.2), and has been developed within the project kopal. 
     " archive object consists of a packed file [pkzip-ed] that contains an arbitrary folder structure with an arbitrary number of files. 
     At the root level of the folder structure, there must be a file named "mets.xml" representing a valid XML file, according to the METS 1.4 schema.
     "The specifications of METS 1.4 state a number of sections and sub-sections for each METS file. Of those sections only METS Header, Descriptive Metadata (dmdSec), Administrative Metadata (amdSec) with the sub-sections Technical Metadata (techMD) and Digital Provenance Metadata (digiprovMD), as well as File Section and Structural Map are defined for the UOF..."
     "In the amdSec sections of METS in the UOF, LMER 1.2 is being used. That exploits the modular approach of LMER 1.2. Only the following LMER XML schemas are being used: lmer-object.xsd, lmer-file.xsd and lmer-process.xsd."
LAF Linguistic Annotation Format/Framework
     "As a test, LAF is being used to represent the American National Corpus..." also ISO_TC_37_SC4_N311
     Ide, N. and Suderman, K. (2007). GrAF: A Graph-based Format for Linguistic Annotations. Proceedings of the Linguistic Annotation Workshop,held in conjunction with ACL 2007, Prague, June 28-29, 1-8.
     "...(Ide and Romary, 2004, 2006) developed within ISO TC37 SC41 and as such, implements state-of-the-art best practice guidelines for representing linguistic annotations."
     see also =>GrAF 
LCSH Library of C. Subject Headings
     see also FAST (simplified version), FRSAD
LDC Catalogue metadata and rules
LIFT (Lexicon Interchange FormaT) is an XML format for storing lexicons/dictionaries, still evolving. 
     - - Also brief info at
     some applications using: WeSay, FLEx (FieldWorks Language Explorer) and Lexique Pro.
     see ../res-other-Standards/ Lexicon Schemas and Related Data Models-812_paper.pdf 
        (Thorsten Trippel, Michael Maxwell, Greville Corbett, Cambell Prince, Christopher Manning, Stephen Grimes and Steve Moran 2008;
     see different lexicon representation formats: DATR FSR LMF LIFT TEI CSV
     see SF (standard format -data format of most SIL tools such as Toolbox etc.), 
LMER Long-term preservation Metadata for Electronic Resources
     "There having been, at the time, no suitable standard for a metadata schema specifically for long-term archiving, 
     "German National Library introduced its own schema, based on a model at the National Library of New Zealand."
     ... seems only for preservation, no bibliographic etc. info
METS Metadata Encoding and Transmission Standard (by LOC)  
     is an XML-schema that was developed as a standard data structure describing complex digital library objects.
     Depending on its use, a METS can be used in the role of Submission Information Package (SIP), 
     Archival Information Package (AIP), or Dissemination Information Package (DIP) within the Open Archival Information System [OAIS] Reference Model. 
     Sometimes METS records are exchanged between different centers using the OAI-PMH protocol (which is described below) [Tansley 2006]. 
METUbet -
     "METU has developed a new letter-to-phone conversion rule set that is based on the phonetic symbol set described in [4]. 
     "... However, SAMPA symbols have poor readability since they include characters such as numbers and punctuation symbols. This has led us to develop a new simplified alphabet called METUbet.  The choice of symbol formatting in METUbet is similar to that used within ARPAbet for American English."  
     "...[4] Ergenç İ., Konuşma Dili ve Türkçe'nin Söyleniş Sözlüğü, Simurg Yayınevi, Ankara, Turkey, 1995. "
     - Salor 2002.
MODS Metadata Object Description Schema -
     is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications. 
     The standard is maintained by the  Network Development and MARC Standards Office of the Library of Congress
OAI- a nice place to begin is
   -ORE Object Reuse and Exchange (ver. 1.0 - 17 October 2008 -
     defines a data model for Resource Maps that describe aggregations of web resources, and recommends serialization formats for these Resource Maps. 
   -PMH Protocol for Metadata Harvesting (2002-06-14: Release of OAI-PMH version 2.0) 
     provides an XML message format for the exchange of XML records (typically metadata). It supports selective or incremental harvesting. 
     site =>
     protocol => (very detailed, see especially "Implementing",
     OAI and OAIS: What's in a Name? =>
          "I think it would be a tolerable oversimplification to distinguish the two efforts as: 
           * Open Archives Initiative: The focus is on "openness", through exposing and harvesting metadata through a simple, explicitly defined protocol. Note that metadata harvesting is the only model explicitly addressed 
           * Open Archival Information System: The focus is on "archival-ness" (apologies to William Safire, again) by thoroughly defining the framework, models, and terms needed to discuss long-term preservation of information. Note that protocols are not defined." 
           -Michael L. Nelson, NASA (
OAIS Open Archival Information System reference model, 
     provides a framework for the understanding and increased awareness of archival concepts.
     a NASA recommendation (2002), then ISO? -
     "The OAIS model is a well-known guideline for implementing digital archives." - CCSDS 650.0-B-1: Reference Model for an Open Archival Information System (OAIS). Blue Book (Standard). Issue 1. January 2002.  
OCLC Online Computer Library Center -
     1967, the presidents of the colleges and universities in the state of Ohio founded the Ohio College Library Center (OCLC)
     In 1977, changes in the governance structure that enabled libraries outside Ohio, became OCLC, Inc.
     In 1981, the legal name of the corporation became OCLC Online Computer Library Center, Inc.
     WorldCat—the largest database of bibliographic information in the world, PURL, ...
OLAC-MS Open Language Archives Community -metadata set - [2008-05-31 ]
     an extension of the DC set to the language resource domain, since Dec. 2000. (all documents: standards, extensions, vocabularies, recommendations, etc.)
OWL Web Ontology Language => see Glossary
PADI based on DC,
PREMIS PREservation Metadata Implementation Strategies -
     PREMIS Data Dictionary for Preservation Metadata (ver. 2.0):
PURL Persistent Uniform Resource Locator -
RDF Resource Description Framework - - W3C specifications originally designed as a metadata data model, then a standard for describing web resorces.
     see more... glossary
SAMPA (Speech Assessment Methods Phonetic Alphabet) is a machine-readable phonetic alphabet. It was originally developed under the ESPRIT project 1541, SAM (Speech Assessment Methods) in 1987-89 by an international group of phoneticians, and was applied in the first instance to the European Communities languages Danish, Dutch, English, French, German, and Italian (by 1989); later to Norwegian and Swedish (by 1992); and subsequently to Greek, Portuguese, and Spanish  (1993). Under the BABEL project, it has now been extended to Bulgarian, Estonian, Hungarian, Polish, and Romanian  (1996). Under the aegis of COCOSDA  it is hoped to extend it to cover many other languages (and in principle all languages). 
     On the initiative of the OrienTel project, Arabic, Hebrew, and Turkish  have been added. Other recent additions: Cantonese, Croatian, Czech, Russian, Slovenian, Thai. Coming shortly: Japanese, Korean.
     see also =>phonetic alphabet
SKOS Simple Knowledge Organization System
SMIL - Synchronized Multimedia Integration Language (SMIL 2.1)
     W3C Recommendation 2008, see also TT and TTML -
TCF TextCorpus format (see WebLicht) 
     "The TCF format is a simple stand-off format for linguistic annotated textcorpora. In contrast to some other formats (for example PAULA or MAF), 
     TCF stores all linguistic layers in one file. That means that during the chaining process, the file grows. A typical chain might look like this: 
     Plain Text => Plain Text + Tokens => ... => Plain Text + Tokens + PoS + ParseTrees + Semantic
     TCF strives to be compatible with established especially the dataformats of the ISO/TC 37-SC 
        =>LAF: Linguistic Annotation Framework, =>LMF: Lexical Markup Framework, =>MAF: Morpho-Syntactic Annotation Framework 
     At the moment, converters are available for 
        =>PAULA, =>Negra, =>TüBa-D/Z
TEI Text Encoding Initiative [metadata set]
     A standard for preparation and interchange of electronic texts. The TEI guidelines are described with SGML.
     The most recent guideline is P5 (TEI P5: Guidelines for Electronic Text Encoding and Interchange), 
     November 2007.
     also => corpus.standards
TigerXML - treebank/corpus encoding format (of =>TIGER project)
     .corpus -
TT, TTML (Timed Text Markup Language) – Distribution Format Exchange Profile (DFXP)
     W3C standard for captions, subtitle etc. see also SMIL -
     latest: 1.0 Candidate Recommendation 23 February 2010
     "...provides a standardized representation of a particular subset of textual information with which stylistic, layout, and timing semantics are associated by an author or an authoring system for the purpose of interchange and potential presentation."
     "...In addition to being used for interchange among legacy distribution content formats, TTML content may be used directly as a distribution format, providing, for example, a standard content format to reference from a   or  media object element in a [SMIL 2.1] document."
     Namespaces (Name - Prefix - Value)
     TT - tt: -, TT Parameter - ttp: -, TT Style - tts: -, TT Metadata - ttm: -
     (also TT Profile, Feature, Extension)
UNSPSC United Nations (sometimes, not officially, UNiversal) Standard Products and Services Code
     developed by the United Nations Development Programme (UNDP) and Dun & Bradstreet Corporation (D & B) in 1998.
UOF - see kopal's UOF
VIAF Virtual International Authority File - -
     (project with the LOC, the Deutsche Nationalbibliothek, the Bibliothèque nationale de France, and OCLC)
     The project's goal is to lower the cost and increase the utility of library authority files 
     by matching and linking the authority files of national libraries, 
     and then making that information available on the Web.
     ... initial research findings of the current VIAF system model show that 
     less than 10% of all records would be linked, this may be "good enough."
VRA Core 4.0 (2007) is a [meta]data standard for the cultural heritage community that was developed by the Visual Resources Association
WorldCat by OCLC, "the largest database of bibliographic information in the world."
XCES Corpus Encoding Standard for XML - - =>corpus, tagging
     XML Version of the =>CES, schema:
     see also "AMERICAN NATIONAL CORPUS FIRST RELEASE Encoding Conventions"
A p p l i c a t i o n s [top]
links: (link to comparision table etc.) ("Search or Browse for Repositories"...)
       !!! => Products in the field of Language Technology (from LT-World)
       !!! CLARIN Virtual Language Observatory - Tools:
EKLE !!!! - "Page web personnelle de Michel Jacobson" :
    * Interlinear Text Editor
    * SoundIndex
    * SoundApplet
    * Mediacut
    * Consoles
    * Textometre

*** repository & archive
- comparision chart
  comparitive report
  Open Arch.Forum final reports (2002-3) (review of tech. issues etc.)
- Archimede (seems frozen?), CONTENTdm, EPrints(**), DSpace(***), Fedora(*), Zentity (Billy Gates), Mitos (Türkçe), OPUS (Stuttgart Uni.)
- AT (Archivists Toolkit)
- MyCoRe, Archon
- VuFind (library resource portal)
- kolibri / kopal tools
- greenstone
- DAITSS (Dark Archive in the Sunshine State)

*** repository & archive related/tools
- Open Archives Harvester (from PKP)

*** tts
- eSpeak, Euler, Mbrola

*** Speech Recognition
- Sphinx/CMU Sphinx - Open Source Toolkit 
- Julius - large vocabulary continuous speech recognition (LVCSR) 

*** annotation, sound editing, caption-subtitle,... 
* links for annotation tools and formats: 
* Katharina Rohlfing 2006. Comparison of multimodal annotation tools. In: Gesprächsforschung, Ausgabe 2006 (7.Jahrgang) / Gesprächsforschung Online-Zeitschrift zur verbalen Interaktion Herausgegeben von Arnulf Deppermann und Martin Hartung ISSN 1617 - 1837 |	pdf
  compares Media and text editors, Anvil, ELAN, EXMARaLDA, TASX, MacVisTA.
- NCAM - National Center for Accessible Media
- AGTK Annotation Graph Toolkit
- Praat, Transcriber, Akustyk (free Praat plug-in), Anvil, 
- Audacity (recommended by british oral history society)
* conversation analysis:
- FOLKER (ein AGD projekt) 

*** Taggers:
- SFST (Stuttgart Finite State Transducer tools) (Schmid,2005) - A Programming Language for Finite State Transducers -
  code branches: AFST, HFST, TRmorph
- RFTagger (Schmid-Laws-2008):
- TreeTagger (TC project, 1993-94): 
- Oslo-Bergen Tagger:
- TnT Tagger (Brants 2000): 
- Uplug toolkit [] for automatic structural markup, such as tokenization and sentence segmentation, as well as sentence and word alignment.
- Stockholm TreeAligner
- HunPos tagger, MaltParser (free tools)

*** Corpus tools:
CWB (Corpus Work Bench) Query system/engine -
Glossa – The Corpus Explorer, Version.0.9. Nygaard, Lars. 2007. -
     "...a web-based user interface for querying linguistic corpora. It is a front-end for the corpus query engine CWB." -
Zemberek - Zemberek 2 is an open source NLP library for Turkic languages -

*** etc. linguistic tools 
- The FreeLing package consists of a library providing language analysis services, written in C++, using standard STL templates, so it can be compiled in almost any platform.
FreeLing is designed to be used as an external library from any application requiring this kind of services. Nevertheless, a simple main program is also provided as a basic interface to the library, which enables the user to analyze text files from the command line.
Main services offered by FreeLing library:
    * Text tokenization * Sentence splitting * Morphological analysis * Suffix treatment, retokenization of clitic pronouns
    * Flexible multiword recognition * Contraction splitting * Probabilistic prediction of unkown word categories * Named entity detection
    * Recognition of dates, numbers, ratios, currency, and physical magnitudes (speed, weight, temperature, density, etc.)
    * PoS tagging * Chart-based shallow parsing * Named entity classification * WordNet based sense annotation and disambiguation
    * Rule-based dependency parsing * Nominal correference resolution
- SIL Tools: more than 60 pieces of software...
    * Shoebox, Toolbox, FieldWorks -A suite of software tools to help language teams manage language and cultural data, with support for complex scripts.- Language Explorer; 
    * Lexique Pro (
- WeSay enables a much larger group of people to build dictionaries of their own languages.

*** metadata tools

*** e-learning & Courseware Tools
- see links at UNESCO Free & Open Source Software Portal:

*** RDF
- RAP (rdf api...)

*** xml editors
- XMLCopyEditor (the best so far), XPontus, XMLFox (free edition is too simple), MEX

*** Other
- TiddlyWiki (portable diary?)
- Interactive Sagittal Section (nice javascript, fonetik...)

AGTK Annotation Graph Toolkit
     Project is based at the LDC (Linguistic Data Consortium) at the University of Pennsylvania.
     Applications (last updated 2007-03-31):
     # MultiTrans: transcribing multi-party conversation
     # TableTrans: observational coding of audio
     # TreeTrans: syntactic annotation
     # InterTrans: interlinear text transcription
akustyk - free Praat plug-in by Bartlomiej Plichta -
Annotate - Author: Oliver Plaehn (, Last updated:	June 23, 2000
     is a tool for syntactic annotation of natural language corpora.  It provides an easy-to-use graphical user interface, a comprehensive set of commands for manipulation of syntactic structures, an interface to an external tagger and parser, a simple search function, an undo function, postscript/printer output, and many more features.
Anvil is a free video annotation tool -
     Some special features are cross-level links, non-temporal objects, timepoint tracks, coding agreement analysis and a project tool for managing whole corpora of annotation files.
     ...can import data from phonetic tools like PRAAT which allow precise and comfortable speech transcription. 
     Anvil can display waveform and pitch contour. Anvil's data files are XML-based. Exported tables can be used for analysis in statistical toolkits like SPSS. The coming version will also be able to import ELAN files. 
     ANVIL is written in Java
Archimede - A canadian software solution for institutional repositories 
     Laval University Library, model after DSpace...
Archivists Toolkit -ver. 2.0 09/30/2009
     Java, Supported Database Backends: MySQL 5.0 (with the InnoDB storage engine), MS SQL Server 2005+, Oracle 10g
     FAQ: - Download:
     Name records in the Archivists’ Toolkit are designed to conform with the International Council on Archives’ ISAAR(CPF)
     Currently, the application supports accessioning and describing archival materials; establishing names and subjects associated with archival materials, including the names of donors; managing locations for the materials; and exporting EAD finding aids, MARCXML records, and METS, MODS and Dublin Core records. 
     Future functionality will be built to support repository user/resource use information, appraisal for archival materials, expressing and managing rights information, and interoperability with user authentication systems.
     The AT project is a collaboration of the University of California San Diego Libraries, the New York University Libraries and the Five Colleges, Inc. Libraries, and is generously funded by The Andrew W. Mellon Foundation.
Archon -archiving Web Tool Kit -
     a project at The University of Illinois at Urbana-Champaign 
Audacity a free, cross-platform Sound editor with many filters etc. (ver 1.2.6 -12.2009)
CONTENTdm Digital Collection Management Software
CWB (The IMS Open Corpus Workbench) Query system/engine 
     ...a collection of open-source  tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations. 
     Its central component is the flexible and efficient query processor CQP. 
     the latest version of CWB is available at (first 1993-94 project: 
     see also Glossa, a web-based UI/front-end
DAITSS - Dark Archive in the Sunshine State 
     digital preservation repository system preserves digital content for the very long term.
     developed by the Florida Center for Library Automation. -
     Technical requirements: Java 5 or later / MySQL Version 4.x or later, with InnoDB data table support / The Sun Java Virtual Machine (JVM), version 5 or later. The JVM will be present if you have installed Sun's JRE, J2SE or J2EE systems.
    * A working Mail Transfer Agent (MTA) such as Sendmail or Postfix. 
DBpedia Ontology -is a shallow, cross-domain ontology, which has been manually created based on the most commonly used infoboxes within Wikipedia. 
     The ontology currently covers over 205 classes which form a subsumption hierarchy and has 1,210 properties.
DIAS-Core is based on standard IBM software components. Together with the kopal tools, they make kopal solution.
     see kopal project
DSpace free repository (by HP Labs and MIT Libraries)
     Current Release 1.5.2, Oracle / PostgreSQL,
DoBeS Dokumentation Bedrohter Sprachen / Documentation of Endangered Languages
     Addit  	a tool to 'plug' notes to elements of the MPI archive.
     AMS 	a tool to grant and deny the access to a (part of a) corpus.
     Annex/Trova the ANNotation archive EXploration tool in the MPI web-based framework
     Arbil -offline (?) IMDI Metadata Editor,Browser & Organizer Tool
     Elan Multimedia Annotator
     Imdi (ISLE Metadata Initiative) Editor - offline (MPInst. page says online?!) metadata editor for linguistic resources
       -needs Java 1.5 WebStart (javaws)
     Lamus 	Language Archive Management and Upload System
     Lexus 	a web-based lexicon tool
     Synpathy a tool for manual syntactical annotation
     Vicos to complement lexical spaces (as created by LEXUS) with ontological spaces.
E-MELD online tools for OLAC repository, metadata etc.
EPrints - Digital Repository Software, Open Access and Institutional Repositories (by Uni. of Southampton)
     Latest Version: v3.1.3 (May 2009), Oracle/MySQL, Apache, ActivePerl - 
     see ePrints project too.
eSpeak - free TTS
EULER - TTS program using MBROLA engine
EXMARaLDA -  is an acronym of "Extensible Markup Language for Discourse Annotation". It is a system of concepts, data formats and tools for the computer assisted transcription and annotation of spoken language. 
     ...developed in a project at the Collaborative Research Center "Multilingualism" (Sonderforschungsbereich "Mehrsprachigkeit" - SFB 538) at the University of Hamburg.
     EXMARaLDA suite includes Partitur Editor for transcription, 
                              Corpus Manager (CoMa) for managing corpus metadata and 
                              advanced corpus query tool EXAKT.
     The system's software tools - an editor for transcriptions in musical score notation, a corpus manager for administrating corpus meta-data, and a concordancing tool - are freely available to users outside the SFB. 
     The main features: XML based data formats - All EXMARaLDA transcriptions are stored in XML files 
                        Java based tools -
                        Interoperability - The EXMARaLDA concept is loosely based on the annotation graph framework (Bird/Liberman 2001) and thus aims at a maximal exchangeability and reusability of transcription data.
                        Hence... edit data also with other popular software (like Praat, ELAN  or the TASX Annotator).
Fedora - Flexible Extensible Digital Object Repository Architecture (not to be confused with the Linux distribution named Fedora)
     Cornell University, MySQL/PostgreSQL/MS-SQLServer
Festival TTS (speech synthesis) system, Uni. of Edinburgh (The Centre for Speech Technology Research)
     also CSLU Toolkit (from Oregon) modifies Festival, frozen by Dec.2004:
FOLKER (ein AGD projekt -Archiv für Gesprochenes Deutsch) -
     Zur Umsetzung dieser neuen Transkriptions-Konventionen wurde von Thomas Schmidt (Universität Hamburg), dem Entwickler von EXMARaLDA, der neue Editor FOLKER entwickelt, mit dem FOLK transkribiert wird.
     conversation analysis = Gesprächsanalytische - uses =>GAT 2 transcription system
FreeLing - package consists of a library providing language analysis services -
     written in C++, using standard STL templates, so it can be compiled in almost any platform.
     FreeLing is designed to be used as an external library from any application requiring this kind of services. Nevertheless, a simple main program is also provided as a basic interface to the library, which enables the user to analyze text files from the command line.
     Main services offered by FreeLing library:
    * Text tokenization * Sentence splitting * Morphological analysis * Suffix treatment, retokenization of clitic pronouns
    * Flexible multiword recognition * Contraction splitting * Probabilistic prediction of unkown word categories
    * Named entity detection * Recognition of dates, numbers, ratios, currency, and physical magnitudes (speed, weight, temperature, density, etc.)
    * PoS tagging * Chart-based shallow parsing * Named entity classification * WordNet based sense annotation and disambiguation
    * Rule-based dependency parsing * Nominal correference resolution
FSTs (finite-state transducers)
     * OpenFst - is a library for constructing, combining, optimizing, and searching weighted FSTs. -
       contribution: OpenFst for Windows - OpenFst library and command line tools for Windows compiled with Visual Studio. 
     * SFST (Stuttgart Finite State Transducer tools) -
     a freely available finite state tool set particularly aimed for implementing morphological analyzers
     SFST-PL: A Programming Language for Finite State Transducers, the programming language of the SFST tools
     Schmid 2005. In Proceedings of FSMNLP 2005, Helsinki, Finland.
     * Further developments e.g. code branches / forks: 
        AFST project of Anssi Yli-Jyrä - (with clear explanation); 
        HFST project led by Krister Lindén and Kimmo Koskenniemi, University of Helsinki; 
        TRmorph: A Turkish morphological analyzer by Çağrı Çöltekin (weaker than the others, and TrFST would be a better name...)
Glossa – The Corpus Explorer, Version.0.9. Nygaard, Lars. 2007. -
     "...a web-based user interface for querying linguistic corpora. It is a front-end for the corpus query engine CWB." -
     => CWB 
greenstone - -
     "...Greenstone is a complete digital library creation, management, and distribution package created and distributed by the New Zealand Digital Library Project."
     "...a suite of software for building and distributing digital library collections. Produced by the New Zealand Digital Library Project at the University of Waikato
     "...Java2 Runtime Environment as a pre-requisite and  GSDL 2.72 (windows version). You may install ImageMagick and Ghostscript softwares, required to build 
     serve and harvest OAI-PMH, METS and DSpace import/export, multilang, etc.
     more info: collections and to do advanced conversion of PDF and Postscript documents, 
HunPos - (former
     1.0 (2007) is an open source reimplementation of TnT (the well known part-of-speech tagger by Thorsten Brants)
     implemented in OCaml, a high-level language, which has a high-performance compiler that produces native code with speed comparable to C/C++ implementations. 
Interactive Sagittal Section -
     Displays sagittal sections and IPA transcriptions for user-specified lip and tongue positions, using JavaScript. 
Julius -
     is a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. Based on word N-gram and context-dependent HMM, it can perform almost real-time decoding on most current PCs in 60k word dictation task. Major search techniques are fully incorporated such as tree lexicon, N-gram factoring, cross-word context dependency handling, enveloped beam search, Gaussian pruning, Gaussian selection, etc. Besides search efficiency, it is also modularized carefully to be independent from model structures, and various HMM types are supported such as shared-state triphones and tied-mixture models, with any number of mixtures, states, or phones. Standard formats are adopted to cope with other free modeling toolkit such as HTK, CMU-Cam SLM toolkit, etc.
     Note: you should prepare a language model and an acoustic model  to run a speech recognition with Julius
     The main platform is Linux and other Unix workstations, and also works on Windows. Most recent version is developed on Linux and Windows (cygwin / mingw), and also has Microsoft SAPI version. Julius is distributed with open license together with source codes.
     Julius has been developed as a research software for Japanese LVCSR since 1997, and the work was continued under IPA Japanese dictation toolkit project (1997-2000)
Kay Pentax -
     Tr temsilcisi: Erişçi - Hürol Erişçi, biyomedikal uzman Engin bey - İskete sok. 17/3 Firuz Apt. Şişli - 230 87 06, 231 06 77
     Multi Speech 3700 -analyzer and annotator, IPA fonetik transkripsiyon (bak. Mine Kılıç, K.Maraş Merkez Ağzı)
     CSL Computerized Speech Lab. (latest Model 4500)
     MDVP Multi Dimensional Voice Program
koLibRI - kopal Library for Retrieval and Ingest
     The kopal Library for Retrieval and Ingest (koLibRI) represents 
     a library of Java tools that have been developed for the interaction with the DIAS system of IBM within the kopal project.
MaltParser - (former
     "MaltParser 1.0 (2007) - 1.3+ (2009) is a complete reimplementation of MaltParser in Java" - pre-trained models: Swedish, English
     ...developed by Johan Hall, Jens Nilsson and Joakim Nivre at Växjö University and Uppsala University, Sweden.
     ...a system for data-driven dependency parsing, which can be used to induce a parsing model from treebank data and to parse new data using an induced model."
     Malt-XML (Malt-XML is an XML-based representation format for dependency treebanks.), Malt-TAB and MaltConverter (to/from TIGER-XML)
     Version 0.4 ...implemented in C, installs for Solaris, Linux, Windows (requires Cygwin), Mac OS X
     ...pretrained parsers using support vector machines for three languages: Swedish, English, Chinese 
     1996, Alain Ruelle, TCTS Lab - Faculté Polytechnique de Mons 
     1) MANDATORY : The kernel, composed of the MBROLA binary and voices. The zip files you should copy here depend your operating system and computer. The binary itself takes less than 100kb, and a voice/language about 5 MBytes.
     2) OPTIONAL : The previous binaries can synthesize speech by itslef, or be combined with one of the following:
     * FULL TTS SYSTEMS FOR MBROLA allowing Text To Speech conversion with Mbrola
     * MBROLIGN, a prosody transplantation tool for Windows. Make it sing, or copy your voice for natural prompts...
     * XLANG, a cross language adaptation filter that lets MBROLA speak with foreign accents, using pho files for a language and speaking them with another language database. It is provided by Anne Warlus as a Perl script with GNU GPM license. See also Mike Hamilton's cross-language synthesis experiments.
     * EmoFilt, a prosody filter that adds emotion to your .pho files, provided both with source codes and binary by Felix Burkhardt.
     * Some sound utilities and script (11Kb)
MEX - xml editor-for-EAD-etc, java (Eclipse Product)
MiTOS Açık Erişim Sistemi 
     Kütüphane yazılımcısı MikroBeta ( ve ANKOS işbirliğiyle, OAI conformant
MyCoRe (My Content Repository) an opensource framework for storage and retrieval of multimedia objects -
     System requirements: a Java servlet application which runs in any servlet container like Apache Tomcat (recommended), Jetty or IBM WebSphere
     / Web Application Server: Apache Tomcat / Java 6 SDK, Apache Ant 
     / Database: MySQL, Oracle, IBM DB2, HSQLDB (is in MyCoRe included) or other relational database management system
     There are some portal applications (
     Demo application: DocPortal - URL: 
NCAM - National Center for Accessible Media -
    caption-subtitle, accessibility issues on web etc.
    * MAGpie, free software for adding captions and video descriptions to QuickTime, Windows Media, Real and Flash multimedia
       MAGpie is a Java2 application which runs on Windows 9x/NT/2000/XP/Vista/Windows 7. 
       It requires the installation of several applications, and it is critical that you install them in the order indicated
    * CCforFlash, a free Flash component that can be used to display captions of Flash video and audio content
    * ccPlayer and ccMP3Player, free players that incorporate CCforFlash components; useful for non-Flash authors who want to add captions to Flash video or audio, respectively
    * CaptionKeeper, software that converts television closed-caption data into Web streaming formats
    * NCAM QA Favelet, a tool to help developers identify accessibility problems on Web pages
    * STEP, the Simple Tool for Error Prioritization for Section 508 compliance
Open Archives Harvester - PKP OAI Harvester
     allows you to create a searchable index of the metadata from Open Archives Initiative (OAI)-compliant archives, such as sites using Open Journal Systems (OJS) or Open Conference Systems (OCS).
OPUS open source "text" repository -
     The project ended in October 1998. The further development of OPUS is funded by the University of Stuttgart now. 
     There is a development co-operation with the Library Service Center for Baden-Württemberg (BSZ) which hosts the OPUS development platform. OPUS Version 3.0 is deployed at Stuttgart University since February 2006. 
Oslo-Bergen Tagger - Oslo Uni, Norway -
     online tagger:
PRAAT phonetics tool, free
     Speech analysis, Speech synthesis, Listening experiments, Labelling and segmentation, and more...
     a set of scripts written for F0 -velocity computation, segmentation, amplitude normalization, etc.
RAP - RDF API for PHP V0.9.6 -
     RAP is a software package for parsing, querying, manipulating, serializing and serving RDF models. 
RFTagger -
     ...Estimation of Conditional Probabilities with Decision Trees and an Application to Fine-Grained POS Tagging
     refer as: Schmid and Laws 2008
     "The main differences of our tagger to a standard trigram tagger are that the order of the Markov model (the k in equation 1) is not fixed andthatthecontextprobability p(ti|t i−1 i−k) isinternallycomputedasaproductofattributeprobabilities. In order to increase the speed, thetaggeralso appliesabeam-searchstrategywhichprunesall searchpaths..."
SIL Speech Analyzer, Phonology Assistant, etc.
SEAL (System of Exhibition and Analysis of Linguistic Data), developed and published by Chitsuko and Yusuke Fukushima first in 1983, 
     works on a personal computer and can be used to process and analyze geolinguistic data and produce linguistic maps. 
     Windows, Visual Basic. Many bugs and, the newest (2004) version SEAL 7.0J works only on Japanese Windows computers.
SesTek (ve GVZ) 
     Prof. M. Levent Arslan, Yrd.Doç. Murat Saraçlar
     İTÜ Ayazağa Teknokent, Maslak 286 25 45
SFST (Stuttgart Finite State Transducer tools) -
Sphinx (or CMU Sphinx) - Open Source Toolkit For Speech Recognition
     workshop will be on March 13...
     1:45-2:10: A Sphinx Based Speech-Music Segmentation Front-End For Improving The Performance Of An Automatic Speech Recognition System In Turkish 
     Cemil Demir, TUBITAK-UEKAE; Erdem Ünal, TUBITAK-UEKAE; Mehmet Ugur Dogan, TUBITAK-UEKAE
Stockholm TreeAligner - A tool for aligning and searching parallel treebanks -
Tagger Chunker =>TreeTagger
TnT (Trigrams'n'Tags) Tagger - - (Brants 2000).
     Thorsten Brants' Statistical Part-of-Speech Tagging, Last changed: 26 Oct 1998!!!
     "TnT comes with two language models, one for German, and one for English. The German model is trained on the Saarbrücker German newspaper corpus using the (=>tagset) Stuttgart-Tübingen-Tagset The English model is trained on the Susanne Corpus. Additionally, there is a pre-compiled model trained on the Penn Treebank."
     command line oriented, developed under Solaris in Ansi-C using GNU C compiler
Transcriber - a free tool for segmenting, labeling and transcribing speech or
TreeTagger -
     a tool by Helmut Schmid for annotating text with part-of-speech and lemma information, in the TC (Textcorpora und Erschliessungswerkzeuge) project (1993-94) at the Institute for Computational Linguistics of the University of Stuttgart.
     The TreeTagger can also be used as a chunker for English, German, and French.
     "a Markov Model tagger which makes use of a decision tree to get more reliable estimates for contexual parameters...
     ...then with "Improvements in Part-of-Speech Tagging with an Application to German" (pdf-
     ...also Tagger Chunker, an annotation tool for nominal and verbal chunks based on the TreeTagger with parameter files for German and English -
TRmorph =>FSTs
Uplug toolkit -
     perl / ...for automatic structural markup, such as tokenization and sentence segmentation, as well as sentence and word alignment.
     "Several tools have been integrated in Uplug. Pre-processing tools include a sentence splitter, tokenizer and wrappers around external part-of-speech tagger and shallow parsers. 
     The following external tools are included in the standard package: The Grok system for English (tagging and chunking), and the morphological analyzer ChaSen for Japanese.
     Translated documents can be sentence aligned using the length-based approach by Gale&Church, hunalign or GMA by Melamed and others.
     Words and phrases can be aligned using the clue alignment approach and GIZA++ (a toolbox for training statistical alignment models for SMT).
     Other tools can easily be integrated, for example, The TreeTagger for English, French, Italian, and German, the TnT tagger for English, German and Swedish.
VuFind library resource portal, with GPL open source license, marginal. (15 Oct 2008: 1.0 Release Candidate 1)
     requirements: Apache HTTP Server, PHP, MySQL, Java J2SE
Zemberek -
     Zemberek is an open source, platform independent, general purpose Natural Language Processing library and toolset designed for Turkic languages, especially Turkish. 
     Zemberek is officially used as spell checker in Open Office Turkish version and Turkish national Linux Distribution Pardus.
     java - "Zemberek kok listesi tabanli bir yapiya sahiptir. Yani kelime cozumleme isleminin basari ile tamamlanabilmesi icin dildeki kelime koklerinin biliniyor olmasi gerekir."
     "TRmorph... uses a lexicon based on the heavily modified wordlist of Zemberek spell checker."
Zentity v1.0 - MS free Repository Platform 
     SQL Server 2008 and .NET Framework
XMLCopyEditor - by gnschmidt at users dot sourceforge dot net
XMLFox - by Rustemsoft -
     (they also have XML Converter, convert XML to Excel, CSV, HTML, txt, Access and some other conversions, but NOT FREE),
     free edition is too simple
XPontus - by Yves Zoundi, GPL License,
      java, too simple

E q u i p m e n t [top]
* Portable Digital Audio Recorder Comparison Chart-s
* Solid State Memory Card Recorders:
CompactFlash (CF) > Secure Digital (SD) and Secure Digital High Capacity (SDHC)
Common comparisions of memory card recs: Edirol R-09HR, Marantz PMD-620, -660, -661, -670- -671, M-Audio, Sony PCM-D50, TASCAM DR-1, -07, -100, Zoom H2, Zoom H4n (see also FR-2LE field recorder: Zuhal Muzik Tic San Ltd, Galipdede Cad. No.37, Tunel-Beyoglu 80090, Istanbul, Turkey 212-249-8510 ??
Hard Disc and "Hybrid" Recorders: Korg MR-1, Marantz CDR420

Abbreviations - T e r m i n o l o g y - G l o s s a r y - other acronyms [top]
see (and include) also 
     - "Premis-PREservation-Metadata-data-dictionary-2-0-from-LOC.pdf" (local)
     - "Glossary... National Archives of Australia (further down this document)"
     - list of "Acronyms and expressions used on"
       and the other links there, like "Terminology", 
       "Jargon", etc.
     - "JISC Preservation of Web Resources (PoWR) Handbook" ( glossary
     - DCMI Glossary -
     - linguistic terminology: see GOLD (General Ontology for Linguistic Description) website
     - EAGLES:
annotation / ? notlama (see also =>speech recognition)
     Geoffrey Leech 2004 Chp.2 - Adding Linguistic Annotation.
     POS tagging: one common type of annotation is the addition of tags, or labels, indicating the word class to which words in a text belong. This is so-called part-of-speech tagging (or POS tagging)
     phonetic annotation e.g. adding information about how a word in a spoken corpus was pronounced. prosodic annotation — again in a spoken corpus — adding information about prosodic features such as stress, intonation and pauses. syntactic annotation — e.g. adding information about how a given sentence is parsed, in terms of syntactic analysis into such units such phrases and clauses
     semantic annotation e.g. adding information about the semantic category of words — the noun cricket as a term for a sport and as a term for an insect belong to different semantic categories, although there is no difference in spelling or pronunciation.
     pragmatic annotation e.g. adding information about the kinds of speech act (or dialogue act) that occur in a spoken dialogue — thus the utterance okay on different occasions may be an acknowledgement, a request for feedback, an acceptance, or a pragmatic marker initiating a new phase of discussion.
     discourse annotation e.g. adding information about anaphoric links in a text, for example connecting the pronoun them and its antecedent the horses in: I'll saddle the horses and bring them round. [an example from the Brown corpus]
     stylistic annotation e.g. adding information about speech and thought presentation (direct speech, indirect speech, free indirect thought, etc.)
     lexical annotation: adding the identity of the lemma of each word form in a text — i.e. the base form of the word, such as would occur as its headword in a dictionary (e.g. lying has the lemma LIE).
     see also =>transcription for standards, =>tagging
     =>TigerXML, XCES, EAGLES Corpus Encoding Standard (CES), LAF
API Application Programming Interface
archive ...
     see also digital library, repository
articulation boğumlama; net telaffuz
authorities, authority data, authority file, authority record
     see standards: EAD from LOC, ISAAR(CPF) of ICA, FRANAR of IFLA, EAC-CPF
     and LEAF project (to develop a model architecture for a system that uploads distributed authorities (persons and corporate bodies) to a central system and automatically links those authorities that belong to the same entity. Local authority data will be uploaded from the local servers of the participating organisations to the central LEAF system where it is stored in the EAC (Encoded Archival Context) format.
ASR Automatic Speech Recognition 
     =>Speech Recognition
BC / Browsable Corpus 
     a concept which is aimed to help the researcher to navigate in the universe of corpora at the MPI and eventually even in a global universe. 
     XML is used for the meta descriptions.
BWF Broadcast Wave Format
     standard by EBU (EBU Tech 3285), an extension of .wav and recommended by IASA
     metadata can be incorporated into the headers...
CG conceptual graph
CHAT a =>transcription system
CMS Content-or-Course Management System
CNAF Central Name Authority File
Collections - Archives, libraries and museums all view the items they hold as ‘collections’. However, the criteria they use to define ‘collection’ vary, resulting in different approaches to recording descriptions of collections and the individual items within them.
     For archivists, the individual item is an integral part of a group of items that forms the record on an individual or organisation and the description of such a grouping is a fundamental (and standardised) part of archival practise. To an archivist, a 'fonds' collection description denotes a collection that has been accumulated by an individual, business or organisation in the course of their life or work. This may be distinguished from an artificial collection, which does not originate from the same source, but consists of archives that have been brought together and described as a whole. They may, for example, be on the same subject, relate to the same geographical location or be of the same form (e.g. photographs). Archivists have developed their own metadata schema for archival documentation.
     Libraries have focused on description of individual items – the individual records in a catalogue – but have always used the concept of ‘collection’ in their management of these resources. Collections are defined by various criteria – e.g. location, subject, form, use type, bequests by individuals, audience – but where descriptions were provided these tended to be more informal and less structured. Descriptions at collection level were most often found in directories of libraries with collections on specific subjects. Many of the collections held by libraries do not have a fixed set of items – new titles will be added and other titles withdrawn over time – but other collections, for example those donated by a benefactor, do not change in this way.
     Museums also use the concept of ‘collection’ using various criteria – e.g. form or type of object, subject, objects donated by an individual benefactor – to describe and manage these groupings. Traditionally, museums have recorded inventory type documentation, including information on provenance and value, but did not make this publicly available at the point of display. It should be noted that although a small museum might display all the items it holds, larger museums display only a proportion of their holdings and store the remainder. The emergence of digital technology, resulting in the creation of surrogate images of items, has changed requirements with regard to both item level and collection level description.
     The practice of creating ‘collection level’ descriptions is not new to the different curatorial traditions, but there is an increasing need to improve the effectiveness of digital resource discovery techniques, especially where users want to search across the distributed holdings, both digital and physical, of several institutions. It is in this context that ‘collection level description’ across curatorial traditions has developed.
     - UKOLN (
continuum / geographical dialect continuum / dialect continua ? süreklilik
controlled vocabulary - rules or guidelines or lists of words for "value"s
     lists: "authority lists" or "authority files". These are sources of proper nouns (e.g. people, organisations, places), their "authority headings" and references.
     Library of Congress Authorities -
     International Standard Archival Authority Record for Corporate Bodies, Persons, and Families - ISAAR(CPF) -
     UN Standard Products and Services Code (UNSPSC)
     also => vocabularies
Corpus (plural corpora) / derlem
     a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.
     A corpus is a large, usually computerized, database of spoken and/or written texts of a language, which allows for “searching for, retrieveing, sorting and calculating lingusitic data” (McEnery & Wilson 1996: 17).
     =Bir dildeki çeşitli sözlü veya yazılı metinlerin bir araya getirilip genellikle bilgisayar yardımıyla işlendiği büyük veritabanlarına derlem denir.
     ...? a large and structured set of texts (usually electronically stored and processed) 
     also => treebank (or parsed corpus) / also => standards > corpus.standards
crosswalk see metadata mappings
CSV comma separated value
DIDA (DIskursDAtenverarbeitung) a =>transcription system
digital library ...
     see also archive, repository
DMS Document Management System
DNS Domain Name System
domain model -sometimes used as a synonym for ontology
donator [speaker / informant] ? sağlayıcı / konuşmacı / bilgilendirici
ECMAScript - is an object-oriented programming language for performing computations and manipulating computational objects within a host environment. ECMAScript as defined here is not intended to be
     The Ecma Standard (1st ed. 1997) is based on several originating technologies, the most well known being JavaScript (Netscape) and JScript (Microsoft). The language was invented by Brendan Eich at Netscape and first appeared in that company’s Navigator 2.0 browser.
     3rd ed. of the ECMAScript standard was adopted by the Ecma General Assembly of December 1999 and published as ISO/IEC 16262:2002 in June 2002. 
     ECMA-262 5th ed. 2009.
field data (collected) 
     perceptual (algısal) judgements, articulatory data, simple recordings
GAT (Gesprächsanalytisches Transkriptionssystem, neu GAT 2) a =>transcription system [for conversation and discourse analysis, a version of Jefferson-style]
     "Gesprächsanalytisches Transkriptionssystem 2 (GAT 2)" Gesprächsforschung - Online-Zeitschrift zur verbalen Interaktion (ISSN 1617-1837) Ausgabe 10 (2009), Seite 353-402 (
     English abstract : "This article presents a revised version of GAT, a transcription system first developed by a group of German conversation analysts and interactional linguists in 1998. GAT tries to follow as many principles and conventions of CA's Jefferson-style transcription as possible, yet proposes some conventions which are more compatible with linguistic and phonetic analyses of spoken language, especially for the representation of prosody in talk-in-interaction. After ten years of use by many reseachers in conversation and discourse analysis, it was time to revise the first version, against the background of past experience and in light of new necessities for the transcription of corpora  arising from technological advances and methodological developments over recent years. This text presents the new GAT 2 transcription system with all its conventions. It gives detailed instructions on how to transcribe spoken talk on three levels of delicacy: minimal, basic and refined transcript versions. In addition, it briefly introduces a few tools that may be helpful for the user: the online tutorial GAT-TO and the transcription editing software FOLKER." 
GPL The GNU General Public License 
     the most popular and well-known free software license of the type of strong copyleft license that requires derived works to be available under the same copyleft.
granularity, granulation - The level of detail at which an information object or resource is viewed or described.
grid- a specific computing infrastructure designed to aggregate a set of resources.
     -service has potentially transient/fani service-factory instances (opposed to non-transient and persistent/daimi web service)
     types of sharing resources: Computational Grids, Data Grids,Storage Grids, Equipment Grids
     types by distribution: Internet, Virtual Organisation (VO, most grids fall into this category e.g. TeraGrid, QCDGrid), Local
     The term Grid to denote a distributed computing and storage environment was coined in 1998 by Ian Foster and Carl Kesselman (Globus Toolkit). 
     Grid computing has roots in 1997 by and later in 1999 by SETI@home to harness the power of networked PCs...
     Gridipedia (, an article (, GridCafe (
HIAT a =>transcription system
HLT Human Language Technologies
HMM - Hidden Markov model 
     "Modern general-purpose speech recognition systems are generally based on Hidden Markov Models...
     "speech could be approximated as a stationary process. Speech could thus be thought of as a Markov model for many stochastic processes.
     "Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use.
HTML Hyper Text Markup Language
i18n internationalization
i10n localisation
IR Institutional Repository
instance örnek; bir sınıftan (class) türetilen nesne
ISBN International Standard Book Number
Java ...
JavaScript => ECMAScript
Jefferson-style =>transcription, by Gail Jefferson (1938-2008)
     Jonathan Potter, Professor of Discourse Analysis, Loughborough University - 
     "The principal system of transcription used by conversation analysis and discursive psychology was developed by Gail Jefferson.  It evolved side by side with, and informed by the results of, interaction analysis.  It highlights features of the delivery of talk (overlap, delay, emphasis, volume and so on) that have been found to be live in interaction.  That is, they are features of talk treated as relevant in one way or another by the parties to the interaction. 
     This is not the only system of transcription available.  Another well known system used in some discourse analytic and ethnographic work was developed by John Du Bois.  However, the Jeffersonian system has become increasingly standard in the research literature (it is a requirement, for instance, in articles published in Research in Language and Social Interaction).  Note that it is not ideal for all kinds of analytic task.
     For example, it does not encode the sort of features of speech delivery that a full phonetic transcription does, so would not be suitable for studies of speech therapy or the sorts of classic sociolinguistic research on accent variation.
     The most authoritative summary of the Jefferson system is now (available from Gene Lerner’s website):
     Jefferson, G. (2004). Glossary of transcript symbols with an introduction. In Lerner, G.H. (Ed). Conversation Analysis: Studies from the first generation. Amsterdam/Philadelphia: John Benjamins (pp. 13-31).
LAF Linguistic Annotation Format/Framework => standards
LAMS, the Learning Activity Management System, is an open source Learning Design system for designing, managing and delivering online collaborative learning activities. It provides teachers with an intuitive visual authoring environment for creating sequences of learning activities.
LIFT (Lexicon Interchange FormaT) => standards.LIFT
linguistic map / Sprachatlas
     Ein Sprachatlas ist eine Sammlung von geographischen Karten einer Region, auf denen die Verbreitung von Realisierungen linguistischer Einheiten wie Laute, Wörter, Bezeichungen, Wortformen oder syntaktische Konstruktionen eingezeichnet ist. Er bietet so einen Überblick über die Verbreitung von Mundarten und Differenzierungen innerhalb einer Mundart. 
     Für den deutschen Sprachraum gibt es eine Vielzahl von Atlanten (siehe den Überblick in der Wikipedia), aber nur wenige sind bisher über das Internet zugänglich: 
     Digitaler Luxemburgischer Sprachatlas 
     Digitaler Wenker-Atlas 
     Hessen-Nassauisches Wörterbuch 
     Sprechender Sprachatlas der Bayern
Linked Data - In summary, Linked Data is simply about using the Web to create typed links between data from different sources.
     The goal of Linked Data is to enable people to share structured data on the Web as easily as they can share documents today.
     The term Linked Data was coined by Tim Berners-Lee in his Linked Data Web architecture note. 
     The term refers to a style of publishing and interlinking structured data on the Web. 
     The basic assumption behind Linked Data is that the value and usefulness of data increases the more it is interlinked with other data.
     "How to Publish Linked Data on the Web":
     Open Archives Demo showing how a OAI-PMH endpoint is exposed as Linked Data with OAI2LOD server:
LMS Learning Management System, e-learning
     see also VLE, C(ourse)MS
LOC Library of Congress
lossy formats 
     audio: MP3, ATRAC, AAC, Ogg Vorbis, Musepack, WMA
     image: gif, png, jpeg (?)        
LPCM Linear PCM, see PCM
LR Language Resource-s
     "Collections of data which primarily document communicative acts of humans by some form of recording and/or descriptions, both directly as in corpora, or at higher levels of abstraction in lexicons and ontologies. The primary data can be text, video recording and/or audio tracks." - 
LRT Language Resources and Technology
     see D-Spin and "WebLicht: Web based LRT services for German" -
metadata - data about data
     Nathan & Austin 2008.
     "...the golden rule of metadata: Store the metadata in your images ...Locally I can embed the metadata as IPTC or XMP in the JPEGs using a variety of tools... The easiest option to add metadata to your images is to use the Windows Photo Gallery built into Windows Vista. This stores metadata in an XMP section in the image file."
     The following are from
     Inline Image Metadata Standards: "The next question is how is data stored in an image. File formats like JPEG, RAW and TIFF, commonly used in digital photography / processing, allow additional metadata to be stored in the image file."
     "there are three significant inline metadata formats here—Exif, IPTC and XMP. Exif is used by most digital cameras to attach information like exposure at the point you take a picture and tools are available for adding information manually (or automatically) such as geocoding. IPTC “extends” Exif, providing a whole load more named “tags” for metadata. Both are fragile from a technical point of view (easy to destroy / corrupt) and problematic from the point of view of extending (if there’s no named “tag” for the metadata you want to store, you’re out of luck). Meanwhile, XMP, the work of Adobe, is the “holy grail”, providing XML / RDF / Dublin core goodness for storing metadata.
     * The PHP JPEG Metadata Toolkit:
     * Embedding metadata with Flash - Embedding GIF, PNG, JPEG, or MP3 files
metadata mapping / crosswalk provides a mapping of metadata elements from one metadata standard to another. 
     The prerequisite to a meaningful mapping requires a clear and precise definition of the elements in each standard.
Meta-Description a structured set of meta-data
MP2,-3, -4 MPEG (Moving Pictures Expert Group) layer 2, 3 (File Format)
NLP natural language processing
NLR Natural Language Resources
ontology : formal, explicit specification of a shared conceptualisation
     computer technology: "formal representation of a set of concepts within a domain and the relationships between those concepts..."
     w3c: "Ontology is a term borrowed from philosophy that refers to the science of describing the kinds of entities in the world and how they are related. An OWL ontology may include descriptions of classes, properties and their instances."
     "explicit  formal  specifications of  the  terms  in the  domain  and  relations  among  them  (Gruber 1993)"  Noy & McGuinness 2001
     An ontology provides a shared vocabulary, which can be used to model a domain — that is, the type of objects and/or concepts that exist, and their properties and relations.
     The core meaning within computer science is a model for describing the world that consists of a set of types, properties, and relationship types.
     "an ontology is a description of the concepts and relationships that can exist for an agent" -Tom Gruber 1995
     Dublin Core, a simple ontology for documents and publishing.
     Ontologies: An ontology is a specification of a conceptualization. As such, the term refers to a wide variety of structured objects that in general will include concepts, their definitions and relations between them. Ontologies have to be formalised to a degree that formal inferences can be performed on these data.
     "Ontology uses a representation vocabulary that is a collection of facts about domain to represent the knowledge of the domain (Oguejiofor, et all, 2004). 
     "Ontology typically consists of a hierarchical arrangement of the classes and subclasses. 
     "In addition, description of the features and attributes of these classes are specified as well as the relevant restrictions. 
     "Also, a knowledge base that is the addition of individuals, can be developed in the ontology."
     W3C (OWL): "In order for ontologies to have the maximum impact, they need to be widely shared. 
                 In order to minimize the intellectual effort involved in developing an ontology they need to be re-used.
                 In the best of all possible worlds they need to be composed. For example, you might adopt a date ontology from one source and a physical location ontology from another and then extend the notion of location to include the time period during which it holds." -Ontology Mapping,
     = domain/information/reference/conceptual model, domain, realm, universe, "schema"...
     see. domain model, application ontology, OWL, RDF, a list
OPAC on-line public access catalogue
OS, OSS Open Source Software
OWL Web Ontology Language (w3c recommendation, and surprisingly not Ontology Web Language) 
     is the standard language that is proposed by the W3C, to represent ontology in the Web. -
     "OWL Full can be viewed as an extension of RDF, while OWL Lite and OWL DL can be viewed as extensions of a restricted view of RDF." -
     The three sublanguages of OWL: OWL Lite supports those users primarily needing a classification hierarchy and simple constraints; OWL DL, OWL Full
     Classes: User-defined classes which are subclasses of root class owl:Thing
     Properties: A property is a binary relation that specifies class characteristics.
     Instances: Instances are individuals that belong to the classes defined.
     Operations: OWL supports various operations on classes such as union, intersection and complement. It also allows class enumeration, cardinality, and disjointness.
     Protégé (OWL tool. Stanford Uni. Protégé is a free, open source ontology editor and knowledge-base framework. Latest: Protégé 4.0.1 - October 27, 2009) 
     In addition, Pellet 1.5 has been used for consistency checking and classifying of the proposed ontology. 
parallel treebank =>treebank
PCM (pulse code modulation) an uncompressed format in digital recorders that capture sound 
     is a method of encoding audio information digitally. -wikipedia
PDF Page Description Format (Adobe file format)
phonetic alphabet
     => IPA, SAMPA
PID Persistent and unique Identifiers
     Handle System by CNRI, DOI
pitch ? (phonology) atım; fırlama
PoS Part of Speech
PURL persistent URLs -w3c terminology 
R&D Research and Development
RDF Resource Description Framework - - RDF Schema 1.0, 2004:
     Concepts and info: /
     W3C specifications originally designed as a metadata data model, then a standard for describing web resorces. 
        It has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources, 
        using XML as the language for representing these statements and exchanging them between machines.
        "the sky (subject -resource) has the color (predicate -aspects of the subject) blue (object).
        see also Linked Data,
     RDF statements are represented by "triple"s, represented by such "graph"s (a "node-arc-node" link):
     Subject  --------->  Object
              Predicate (=property, denotes relationship between the things denoted by the nodes)
     Subject node: URI* or blank / Property: URI reference (URIref) / Object: URI, literal** or blank
     *)with optional URIref  **)identify values such as numbers and dates by means of a lexical representation) 
     SPARQL is a query language for RDF / see also RAP (Rdf API for PHP)
     see also OWL (w3c recommendation, language)
REC Research Evaluation Committee
repository ...
     see also archive, digital library
     = entity, element, content, (digital) object, data/information object, stuff, thing
RTF Rich Text Format (file format)
SCORM (Sharable Content Object Reference Model) is a collection of standards and specifications for web-based e-learning. It defines communications between client side content and a host system called the run-time environment (commonly a function of a learning management system). SCORM also defines how content may be packaged
scripting language - is a programming language that is used to manipulate, customise, and automate the facilities of an existing system. In  such systems, useful functionality  is already available through a use interface, and the scripting language is a mechanism for exposing that functionality to program control. In this way, the existing system is said to provide a host environment of objects and facilities, which completes the capabilities of the scripting language. A scripting language is intended for use by both professional and non Service (for web) a container for a set of system functions that have been exposed to the web based protocols.
     see. ECMAScript, JavaScript 
SF (standard format -data format of most SIL tools such as Toolbox etc.
SOAP Simple Object Access Protocol
     is a protocol specification for exchanging structured information in the implementation of Web Services in computer networks. It relies on XML as its message format, and usually relies on other Application Layer protocols (most notably Remote Procedure Call (RPC) and HTTP) for message negotiation and transmission.
SPARQL is a query language for RDF
     query remote RDF repositories using W3C SPARQL protocol:
speech recognition / konuşma tanıma
     Aksoylar 2009: "... Toplamda 34 fonem elde edildi. Elde edilen fonem seti MetuBET alfabesinin bir  alt kümesini teşkil etmektedir..."
SVG Scalable Vector Graphics (files)
     SVG is a language for describing two-dimensional graphics and graphical applications in XML.
tag, tagging/etiketleme, tagset - Part-of-speech tagging is the process of marking up the words in a text with their corresponding parts of speech reflecting their syntactic category. 
     "A HMM (Hidden-Markov-Model) POS (part-of-speech) tagger (Brants 2000, e.g.) computes the most probable POS tag sequence for a given word sequence.
     LOCAL set...
     A list of tagsets ( links to:
     German: Stuttgart-Tübingen Tagset (STTS) consists of 54 part of speech tags for the annotation of German text corpora."
     English: Penn-Treebank tagset
     French: French
     ...also Italian, Marco Baroni's Italian tagset, Spanish, Bulgarian, Russian 
     see => project.METU, also Özgül Salor, => comparision of IPA-SAMPA-METUbet form
     - Nuri BAŞOĞLU, Prof. -Yönetim Bilişim Sis. - Tel. 6936 -
     Çakıcı: "Like Czech, Turkish is highly-inflected and has more word order flexibility than languages like English. It is an agglutinating language,so asinglewordcanbea sentencewithtense,modality,polarity,andvoice.Itallowsbothlocalandlong-distancescrambling.Theformermeansthatargumentsofverbsmayswaporder withinaclause,andthelattermeansthatanargumentmayappearinahigherclause thanthatoftheverbwhichsubcategorisesforit.
     Levent Altunyurt & Zihni Orhan: Part Of Speech Tagger for Turkish. BS thesis in Computer Engineering, Boğaziçi University, June 2006. -
     Oflazer, Kemal, Bilge Say, Dilek Zeynep Hakkani-Tür, and Gokhan Tür. 2003. Building a Turkish Treebank. In: Abeille Anne (ed.), Treebanks: Building and Using Parsed Corpora. Kluwer, Dordrecht, pages 261–277. PDF
     Ruket Çakıcı and Jason Baldridge. 2006. Projective and non-projective Turkish parsing. In: Proceedings of the 5th Conference on Treebanks and Linguistic Theories. Prague. PDF | PDF
     Haşim Sak & Tunga Güngör & Murat Saraçlar 2007. Morphological Disambiguation of Turkish Text with Perceptron Algorithm, In: Computational Linguistics and Intelligent Text Processing, 8th International Conference, CICLing 2007, Mexico City, Mexico, February 18-24, 2007. Proceedings. Springer Berlin / Heidelberg. (see
          "...When we train the perceptron as a POS tagger, the accuracy is 98.27%. Turkish morphological disambiguation and POS tagging results that we obtained is the best reported so far."
          ! only the abstract and the first page, (c) Springer -
          - Tunga Güngör (BU Dept. of Computer Engineering, Assoc.Prof.Dr., 2007-)
            * Biçimbilimsel Tabanlı Dil Modeli ile Türkçe Konuşma Tanıma Sistemi, Tübitak 1001 Research Project 107E261, Tübitak 2007 (continuing). (Fund: 46.490 YTL) (Project Leader)
            "...sinavlarda soru kagidinin arkasina siir basmasiyla unludur. ilk 1 saat debelendikten sonra omer hayyam  okur, 'sinav minav hikaye yaw' der cikista gidilecek nevizadeyi dusunmeye baslarsiniz... maltepe sigarasini her cekisten sonra 'aha gitti herif' dedirtecek kadar oksurur."
          - Haşim Sak (BU Bilgisayar, doktora öğr. advisors: Güngör & Sak)
          - Murat Saraçlar (2005- BU Department of Electrical and Electronics Engineering): 
            * Speech and Language Processing; LargeVocabularyAutomaticSpeechRecognition:Acoustic,PronunciationandLanguageModeling; StatisticalNaturalLanguageProcessing; ApplicationsofSpeechRecognition:SpokenInformationRetrieval,SpeechUnderstanding; StatisticalandInformationTheoreticMethodsinLearning.
            Saraçlar: "...boğaziçi elektronik'in belki de en sıcak ve samimi hocasıdır."
     Haşim Sak & Tunga Güngör & Murat Saraçlar 2008. Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus. In GoTAL 2008, volume 5221 of LNCS, 2008, pages 417-427. Springer. (springerlink)
          In this paper, we propose a set of language resources for building Turkish language processing applications. Specifically, we present a finite-state implementation of a morphological parser, an averaged perceptron-based morphological disambiguator, and compilation of a web corpus. Turkish is an agglutinative language with a highly productive inflectional and derivational morphology. We present an implementation of a morphological parser based on two-level morphology. This parser is one of the most complete parsers for Turkish and it runs independent of any other external system such as PC-KIMMO in contrast to existing parsers. Due to complex phonology and morphology of Turkish, parsing introduces some ambiguous parses. We developed a morphological disambiguator with accuracy of about 98% using averaged perceptron algorithm. We also present our efforts to build a Turkish web corpus of about 423 million words. 
     Dr. Özgür Yılmazel's (2007- present Assistant Research Professor, School of Information Studies, Syracuse University) 
     Dilek Z. Hakkani-Tür, Kemal Oflazer, Gökhan Tür 2000 ???. Statistical Morphological Disambiguation for Agglutinative Languages, Bilkent?.
     Megyesi, Beata 2001??. Comparing Data-Driven Learning Algorithms for PoS Tagging of Swedish.
     Çağrı Çöltekin (to appear 2010). A Freely Available Morphological Analyzer for Turkish  In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC2010), Valletta, Malta, May 2010.
     also Schmid articles: 2005, 2008
     LIST/Bibliography on Turkish Information Retrieval and Natural Language Processing:
TCF TextCorpus format (see WebLicht) 
     TCF strives to be compatible with established especially the dataformats of the ISO/TC 37-SC
     LAF: Linguistic Annotation Framework/Format, LMF: Lexical Markup Framework, MAF: Morpho-Syntactic Annotation Framework 
     At the moment, converters are available for PAULA, Negra, TüBa-D/Z
TiddlyWiki by Jeremy Ruston -
     is a single html file (!portable, full javascript) which has all the characteristics of a wiki - including all of the content, the functionality (including editing, saving, tagging and searching) and the style sheet.
transcription / çeviriyazı
     "Transcription as a mapping from sound to script must be distinguished from transliteration, which creates a mapping from one script to another" -wikipedia
     The use of different systems for transcription is not merely accidential, transcription systems are designed to support a particular kind of analyses.
     HIAT, DIDA, =>GAT (Gesprächsanalytisches Transkriptionssystem, neu GAT 2), CHAT gibi çeviriyazı sistemleri/conventions 
     see also =>standards.TT, TTML (Timed Text Markup Language) =>TEI =>Jefferson-style transcription
     also related: Tasx (Time aligned signal data exchange format)
     * Thomas Schmidt 2007. Transkriptionskonventionen für die computergestützte gesprächsanalytische Transkription.  
transliteration /...
treebank (or parsed corpus) is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure.
     syntactic tree, skeletal parses showing rough syntactic and semantic information
     ... two main groups can be distinguished: treebanks that annotate 
     - phrase structure (for example the Penn Treebank or ICE-GB), root is S or phrase...
     - "dependency" (root is V etc.) vs. "constituent" syntactic description/structure : (Megyesi 2010) "For the syntactic description, we chose dependency rather than constituent structures, as the former has been shown to bewellsuitedforbothmorphologicallyrichandfreeword orderlanguagessuchasTurkish,andformorphologically simplerlanguages,likeEnglishandSwedish."
     ODTÜ-Sabancı Türkçe Ağaç Yapılı Derlemi =>project.METU-Sabanci Turkish Treebank
treebank, parallel- 
     Megyesi 2010: "A parallel treebank is a parallel corpus where thesentencesineachlanguagearesyntacticallyanalyzed, and thesentencesandwordsarealigned. Inthetreebanks,thesyntacticannotationusuallyfollowsasyntactictheory,oftenbasedonconstituentand/ordependency structure(Abeille,2003)..."
     ...belong to a fairly new type of language resource, consequently we find a smaller amount of resources of this type available..."
     * Megyesi, Beata & Dahlqvist, Bengt & Pettersson, Eva & Nivre, Joakim 2008. Swedish-Turkish Parallel Treebank. 
     * Megyesi, Beata & Dahlqvist, Bengt & Csato, Eva A. & Nivre, Joakim 2010. The English-Swedish-Turkish Parallel Treebank. In: Proceedings of Language Resources and Evaluation (LREC 2010). 2010.: 
     "We describe a syntactically annotated parallel corpus containing typologically partly different languages, namely English, Swedish and Turkish." using the Uplug toolkit [] for automatic structural markup, such as tokenization and sentence segmentation, as well as sentence and word alignment.
     ...The annotation is carried on various layers from morphological and part of speech analysis to dependency structures. 
     The tools used for linguistic annotation, e.g. HunPos tagger and MaltParser, are freely available data-driven resources, trained on existing corpora and treebanks for each language.
TTS Text to Speech = phonemisatin, speech synthesis, speech production
TXT Text (file format)
UML Unified Modeling Language
URI Uniform Resource Identifier - is a string of characters used to identify or name a resource on the Internet.
     syntax is  : 
     schemes such HTTP,FTP,mailto,URN,tel,rtsp,file
     URN/URL relation: "you can find urn:ietf:rfc:3187 (URN) over at (URL)."
URL Uniform Resource Locator - is a subset of the URI that specifies where an identified resource is available and the mechanism for retrieving it.
     often incorrectly used as a synonym for URI.
URN Uniform Resource Name - is a URI that uses the urn scheme (like ietf, isbn, isan, etc.)
XML eXtensible Markup Language
     The universal format for structured documents and data on the Web.
XMP - Adobe's Extensible Metadata Platform (XMP) is a labeling technology that allows you to embed data about a file, known as metadata, into the file itself.
XSD XML schema
vocabulary (metadata vocabulary) 'words’ or 'values' to enter into categories of metadata schema
     also => controlled vocabularies
VLE Virtual Learning Environment
     see also LMS, C(ourse)MS
WAV (or WAVE), short for Waveform audio format, also known as Audio for Windows, is a Microsoft and IBM audio file format standard for storing an audio bitstream.
     Both WAVs and AIFFs are compatible with Windows, Macintosh, and Linux operating systems.
web service / webservice is a software system designed to support interoperable machine-to-machine interaction over a network. -W3C (see grid service)
WSDL Web Services Description Language
     It is a platform-, programming language- and protocol-independent description language for web services. WSDL is based on XML.
     "A client program connecting to a web service can read the WSDL to determine what operations are available on the server."
Digital Recordkeeping - Guidelines for Creating, Managing and Preserving Digital Records. 
National Archives of Australia. (c)Commonwealth of Australia 2004

Archives – Records that are appraised as having archival value.9 Note: This definition of the term differs to the IT sphere where it refers to ‘a copy of one or more files or a copy of a database that is saved for future reference or for recovery purposes in case the original data is damaged or lost.’10
Archival data format – A format into which digital data objects are converted for long-term preservation.
Archival value records – Records appraised as satisfying the Archives criteria for indefinite preservation. These records are also referred to as ‘retain as national archives’ (RNA).
Business information system – Organised collection of hardware, software, supplies, policies, procedures and people, which stores, processes and provides access to an organisation’s business information. 11
Capture – The process of lodging a document into a recordkeeping system and assigning metadata to describe the record and place it in context, thus allowing the appropriate management of the record over time.
Checksum – An algorithm-based method of determining the integrity and authenticity of a digital data object. Used to check whether errors or alterations have occurred during the transmission or storage of a data object.
Content – That which conveys information, eg text, data, symbols, numerals, images, sound and vision.
Context – The background information that enhances understanding of technical and business environments to which the records relate, eg metadata, application software, logical business models, and the provenance (ie address, title, link to function or activity, agency, program or section).
Conversion – Process of changing records 1. from one medium to another or 2. from one format to another. Conversion involves a change of the format of the record but ensures that the record retains the identical primary information (content).12
Data – Facts or instructions represented in a formalised manner, suitable for transmission, interpretation or processing manually or automatically.13
Digital record – A record created and/or maintained by means of digital computer technology. Digital records are a subset of electronic records.
Digital signature – A security mechanism included within a digital object that enables the identification of the creator of the digital object, and that can also be used to detect and track any changes that have been made to the digital object.14
Document – Recorded information or object which can be treated as a unit.15
Electronic document management system (EDMS) – An automated system used to support the creation, use and maintenance of electronically created documents for the purposes of improving an organisation’s workflow. These systems do not necessarily incorporate recordkeeping functionality and the documents may be of informational rather than evidential value (ie the documents may not be records).
Electronic messages – Any communication using an electronic system for the conduct of official business internally, between Australian Government agencies, or with the outside world.
Electronic messaging systems – Applications used by agencies or individuals for sending and receiving, as well as storing and retrieving, electronic messages. These systems generally do not possess recordkeeping functionality.
Electronic record – a record created and/or maintained by means of electronic equipment. Includes analogue formats (see also Digital record).
Electronic records management system (ERMS) – An automated system used to manage the creation, use, maintenance and disposal of electronically created records
for the purposes of providing evidence of business activities. These systems maintain appropriate contextual information (metadata) and links between records to support their value as evidence.
Emulation – A digital record preservation approach which involves keeping digital records in their original format and recreating the operating environment to enable the original performance of the software to be recreated on current computers. The result is that the original data format is preserved and may be accessed in an environment that allows for the recreation of the original ‘look and feel’ of the record.
Encapsulation – The process of ‘packaging’ records with enough metadata to preserve their content and context, and to support their reconstruction at some time in the future.
Encryption – Encryption is the process of converting data into a secure code, through the use of an encryption algorithm, for transmission over a public network. The mathematical key to the encryption algorithm is encoded and transmitted with the data, thus providing the means by which the data can be decrypted at the receiving end, and the original data restored.16
Exempt information – Sensitive information defined in section 33 of the Archives Act (eg personal or security related information) which may be withheld from public access beyond 30 years.
Expunge – To delete exempt information from a copy of a record in order to make the remainder of the record available for public access.
Hybrid recordkeeping system – A system containing a combination of paper, electronic or other formats.
Metadata – Structured information that describes and/or allows users to find, manage control, understand or preserve other information over time.17
Migration – The act of moving records from one system to another, while maintaining the records’ authenticity, integrity, reliability and usability. Migration involves a set of organised tasks designed to periodically transfer digital material from one hardware or software configuration to another, or from one generation of technology to another. 18
Normalisation – Process of changing records from one data format to an archival data format.
Record – Information in any format created, received and maintained as evidence and information by an organisation or person, in pursuance of legal obligations or in the transaction of business.19
Recordkeeping – Making and maintaining complete, accurate and reliable evidence of business transactions in the form of recorded information. Recordkeeping includes:
·  creating records in the course of business activity and the means to ensure the creation of adequate records;
·  the design, establishment and operation of recordkeeping systems; and
·  managing records used in business (traditionally regarded as the domain of records management) and as archives (traditionally regarded as the domain of archives administration).20
Recordkeeping system – Framework to capture, maintain and provide access to evidence over time, as required by the jurisdiction in which it is implemented and in accordance with common business practices. Recordkeeping systems include:
1. both records practitioners and records users;
2. a set of authorised policies, assigned responsibilities, delegations of authority, procedures and practices; policy statements, procedures manuals, user guidelines and other documents which are used to authorise and promulgate the policies, procedures and practices;
3. the records themselves;
4. specialised information and records systems used to control the records; and
5. software, hardware and other equipment, and stationery.21
Records management – Field of management responsible for the efficient and systematic control of the creation, receipt, maintenance, use and disposal of records, including processes for capturing and maintaining evidence of and information about business activities and transactions in the form of records.22
Structure – The appearance and arrangement of a record’s content (eg the relationships between fields, entities, language, style, fonts, page and paragraph breaks, links and other editorial devices).
Vital records – Records without which an organisation could not continue to operate – ie those containing information needed to re-establish the organisation in the event of a disaster. Vital records are those that protect the assets and interests of the organisation as well as those of its clients and shareholders.23
Web-based records – A generic term that refers to all types of web-based information that meets the criteria of a record, including public websites, virtual private networks, extranets and intranets.

9 J. Ellis (ed.), Keeping Archives, 2nd edition, Australian Society of Archivists, Thorpe, Melbourne,
1993, p. 463.
10 IBM Dictionary of Computing, McGraw Hill, New York, 1994, p. 30.
11 Adapted from Standards Australia, Records Management Standard, AS 4390 – 1996, Part 1, 4.17.
12 Standards Australia, Australian Standard for Records Management, AS ISO 15489 – 2002, Part 1,
Clause 3.7 and Part 2, Clause
13 International Council on Archives, Dictionary of Archival Terminology, KG Saur, Munich, 1988,
p. 48.
14 Australian Government Information Management Office, Trusting the Internet – A Small
Business Guide to E-security, July 2002, p. 43
15 Standards Australia, Australian Standard for Records Management, AS ISO 15489 – 2002, Part 1,
Clause 3.10.
16 Australian Government Information Management Office, Trusting the Internet – A Small
Business Guide to E-Security, July 2002, p. 43
17 Adrian Cunningham, ‘Six degrees of separation: Australian metadata initiatives and their
relationships with international standards’, Archival Science, vol. 1, no. 3, 2001, p. 274.
18 Standards Australia, Australian Standard for Records Management, AS ISO 15489 – 2002, Part 1,
Clause 3.13 and Part 2, Clause
19 Standards Australia, Australian Standard for Records Management, AS ISO 15489 – 2002, Part 1,
Clause 3.15.
20 Standards Australia, Australian Standard for Records Management, AS ISO 15489 – 2002, Part 1,
Clause 4.19 and Part 3, Foreword.
21 Standards Australia, Australian Standard for Records Management, AS ISO 15489 – 2002, Part 3,
Clause 6.21.
22 Standards Australia, Australian Standard for Records Management, AS ISO 15489 – 2002, Part 1,
Clause 3.16.
23 J Kennedy and C Schauder, Records Management: A Guide for Students and Practitioners of
Records and Information Management with Exercises and Case Studies, Longman Cheshire,
Melbourne, 1994, p. 302.

N o t e s [top]

RESOURCE Types - from CLARIN (Common Language Resources and Technology Infrastructure - EC FP7 project no. 212230 - A European Research Infrastructure)
"...In this document [CLARIN 2009  p. 33] we distinguish the following major resource types: 
• Text Resources: These are all types of resources that are composed of linear texts such as books, papers, articles etc. 
  Q: is a transcript of an audio file a text resource? 
• Speech Resources: These are audio recordings that can contain speech, singing and other events that can be seen as language material. 
• Multimedia/Multimodal Resources: These are moving picture recordings mostly with integrated sound channel. 
• Time Series Resources: These are other types of linear recordings over time created by some device such as data gloves, EEG, FMRI, articulator measurements, eye trackers, motion trackers etc. 
The object of these recordings are persons who perform some linguistically relevant action (e.g. Articulating a word or sentence) or perform an action on a linguistically relevant object (e.g. Reading a text)  
• Images: These are data structures encoding for example photos, drawings, video extractions, scanned texts etc. 
• Annotations: These are structured linguistic encodings that refer to events in time (e.g. Audio or video resources) or to sequences of characters. 
• Lexica: These are complex data structures where lexical units of different types are described with the help of structured attributes. Lexica can include multimedia elements.  
• Concept Registries/Terminologies:  These are flat lists of terms with their related concepts or normative definitions. 
• Ontologies: An ontology is a specification of a conceptualization. As such, the term refers to a wide variety of structured objects that in general will include concepts, their definitions and relations 
between them. Ontologies have to be formalised to a degree that formal inferences can be performed on these data. 
• Objects: An object is an entity in the real world from which linguistic data are derived. Such an object can be a book or a copy of a newspaper on which a text resource is based. 
• Situations: A situation is a real-world event from which linguistic data are derived. Such a situation can be an interview which is recorded (by audio and / or video devices) or an experiment.
• Tools/Services: This is a cover term to describe all sorts of tools, performing NLP speech etc. tasks, web applications and web services that are registered. Time will show whether further sub-classifications will be necessary. 
Some additions to this list which we assume to be necessary: 
• Typological databases: Typological databases contain sets of samples which illustrate one or more linguistic phenomena across a wide range of typologically distinct languages. 
• Grammars: these are more or less formalised accounts of the structure of natural languages 
• Rule sets: Rules describe a set of well-formed constructions of a natural of formal language or 
recurring patterns in linguistic data. 
• Concordances: Concordances are extracts mainly from textual resources which present a keyword 
with a user-defined proportion of its immediate context 
• Wordlists: Wordlists are derived from corpora and present the word form types together with some (often quantitative) information of class information (e.g. gazetters) 
• Transcripts: transcripts are written records of audio recordings or the audio part of video recordings 
• Training data sets: Training sets typically include small sets of representative data together with their 
classification. Training sets are needed for learning tasks and competitions, e.g. CLEF. 
Beyond these classes of basic language resources, we have to deal with aggregate resources. In principle, a 
large number of aggregate resources can be formed from the basic resources mentioned above. However, 
there are some types of aggregate resources which are well-known in the community and which therefore 
should be searchable by their own class names: 
• Text Corpus: are carefully compiled collections of textual data. 
• Speech Corpus: are carefully compiled collections of audio data in combination with symbolic information (e.g. annotations) and metadata (like speaker descriptions). 
• Multimodal Corpus: are carefully compiled collections of data which are produced and / or received through different channels or modes  
• Treebank: these are aggregates of textual (or audio) resources and some levels of (linguistic) 
annotation. Treebanks are used for linguistic and lexical studies. 
• Session/Bundle: This resource type is central to the IMDI metadata, where it is defined as follows: The session concept bundles all information about the circumstances and conditions of the linguistic 
event, groups the resources belonging to this linguistic event, records the administrative information of the event and describes the content of the event. 
Session => Bundle or Resource bundle:
"Since  the  extension  of  the IMDI schema to include written resources the original name of “Session” to indicate a bundle of resources is under scruteny. 
Suggestions to use the name “Bundle” or “Resource bundle” are considered." -IMDI


input        ===>    Process / Activity    ===>  output
Resource-s           ------------------          Resource-s

* shallow parsing : chunking : "light parsing" 
  a method for parsing natural language sentences into partial syntactic structures
  is an analysis of a sentence which identifies the constituents (noun  groups, verbs, verb groups, etc.), but does not specify their internal structure, nor their role in the main sentence.

* parsing : (more formally) syntactic analysis 
  is the process of analyzing a text, made of a sequence of tokens (for example, words), to determine its grammatical structure with respect to a given (more or less) formal grammar.
  IN: source string > lexical analysis > OUT: tokened text
  IN: > syntactic analysis > OUT: parse tree 

text (and audio) > annotation ? > treebank (or parsed corpus)
treebanks > TreeAligning > multi

Plain Text + Tokens + PoS + ParseTrees + Semantic
* Text tokenization * Sentence splitting * Morphological analysis * Suffix treatment, retokenization of clitic pronouns
    * Flexible multiword recognition * Contraction splitting * Probabilistic prediction of unkown word categories
    * Named entity detection * Recognition of dates, numbers, ratios, currency, and physical magnitudes (speed, weight, temperature, density, etc.)
    * PoS tagging * Chart-based shallow parsing * Named entity classification * WordNet based sense annotation and disambiguation
    * Rule-based dependency parsing * Nominal correference resolution

OAI conforming repositories - R e g i s t e r e d   D a t a   P r o v i d e r s
5-6 Mitos, 1 eprints (ted), 1 özel (Gazi)
---------------------------------------------------------------------------------- Number&direction1=a
Vivaldi - transcription images:
Master image  TIFF Long-term storage or print Do not alter, or resize, or compress 
Access image TIFF or JPEG Screen display or print Taken from the master, it is alteredfor presentation over the Web or other uses. 
Thumbnail JPEG or GIF Screen display Taken from access, reduced size but not altered otherwise. 
xxx made up of over 120 projects which are also partnerships 
made up of of over 400 holders of archives, including:
• Regional
• Theme based
• Community 
13th International Congress on Methods in Dialectology (Methods XIII) Leeds, UK. August 4, 2008
4th International Congress of Dialectologists and Geolinguists, Riga, Latvia. August 1, 2003.
target areas: 
* Life Long Learning of the professional group encouraging innovation in methodology and teaching from a European perspective and to support the national discourse about the aims, role and opportunities for school history and social studies in a modern global society
* Intercultural Dialogue through enhancing professional cooperation, communication and networking within Turkey and beyond in order to support the professional group become more connected each other and Europe.
* katılımcı yönetim - ? participatory governance
...developing xxxxx network and developing independent professional organisations...
Expected Outcomes are:
    * A continuously widening trained group of professional history educators who encourage innovation and disseminate the new ideas among their colleagues, students and pupils.
    * A widened independent and sustainable professional network of innovative history educators and historians in Turkey
    * A publication for history and citizenship education based on innovative methodologies ready to be used in the social science and history classes.
    * Improved contacts with colleagues in the region and throughout Europe
    * An educational website disseminating the innovative methodology and sustaining the network in history education throughout Turkey
Planned activities (workshops, seminars and international training events):
    * National capacity building seminars on innovative methodology and European dimension
    * Authors and experts workshops
    * Development of a publication with ready to use educational material
    * Piloting of project material in Turkish classrooms
    * Training seminars throughout Turkey
    * Participation in international training events
    * Development and maintenance of a project website
    * External project review
  = Sabancı'dan => Sözlü tarih, özellikle 1960'lı yıllardan itibaren tarihi yazılı belgelere ek olarak yaşayan bireylerin belleğe dayalı anlatıları aracılığıyla yazma ve sıradan insanları, gündelik yaşamı ve öznelliği tarihin araştırma alanına dahil etme dürtüsüyle şekillenen ve ses kaydetme teknolojilerinin gelişmesiyle de desteklenen disiplinlerarası bir çalışma alanı ve araştırma yöntemidir. Sözlü tarihin birincil amacı, bireylerin yaşamöyküsü anlatılarını ses ve/veya görüntü yoluyla kaydederek bir arşiv oluşturmaktır. Bu arşiv, sözlü tarihçinin ilgi alanına bağlı olarak belirli tarihi dönem ve konular üzerine hazırlanacak ürünlerin ana malzemesini oluşturur. Seçilmiş bir birey, aile, topluluk, mahalle veya yöre de sözlü tarihin konusu olabilir. Yakın zamana dek sözlü tarih arşivleri çoğunlukla ses kayıtlarından oluşmakta, bu kayıtlar yazılı metinler üretmek için kullanılmaktaydı. Bugün, yeni teknolojiler multimedya ürünleri yaratmaya olanak sağlarken, postmodernizm tartışmaları bağlamında kimlik, anlatı, bellek ve öznellik gibi konular sosyal bilimin önde gelen çalışma alanlarını oluşturmuş ve bu iki gelişme, sözlü tarihin 2000li yıllarda ivme kazanmasına neden olmuştur. Sözlü tarih için Türkiye, az çalışılmış ve potansiyel olarak çok verimli bir alandır. Türkiye toplumunun geçmişe yönelik ilgisi yeni gelişmekte ve bireyler günümüzü anlama çabasıyla yüzlerini geçmişe dönmektedirler. Hem popüler kültürde hem de akademide bireysel tanıklıklara artan ilgi, Türkiye'nin sancılı demokratikleşme süreci ve bu sürecin beraberinde getirdiği çokkültürlü geçmişimizle yüzleşme gereksinimiyle ilintilidir. Bireyselleşme sürecinin öne çıktığı '80 sonrası dönemde, ekonomik krize ek olarak kimlik sorunu ve katılımcı demokrasi, Türkiye vatandaşlarının gündemindeki en önemli sorunlardandır. Yöntem olarak sıradan insanların öznel bakış açılarına odaklanan sözlü tarih, Türkiye'nin kültürel kimliği ve demokratikleşme süreci üzerine yapılan tartışmalara önemli katkılar yapabilir. Bu bağlamda sözlü tarih, hem akademik ortamda, hem de sivil toplum alanında önemli bir rol oynayabilir. Örneğin bir birey, sözlü tarih yöntemini öğrenerek kendi ailesinin geçmişini araştırabilir, bir grup gönüllü kendi köylerinin tarihini yazabilir, bir mahalle derneği mahallelerinin tarihini inceleyebilir. Kısacası sözlü tarih, hem bilimsel açıdan hem de gündelik yaşam politikaları açısından değerli bir araştırma yöntemi ve çalışma alanıdır. 
  kılavuz kitaplar:
* Mardinde Sözlü Tarih Çalışmaları Başlıyor - 11 Nisan 2007, Çarşamba
Tarih Vakfı tarafından AB desteğiyle yürütülen Mardin'in İnsan Manzarası projesi kapsamında açılan yerel proje ofisi çalışmalarına başladı.
Proje Asistanı Sevim Çiçek'in yönetiminde çalışmaya başlayan yerel proje ofisi 6 Nisan'da günü proje bilgilendirme toplantısı, 7-9 Nisan'da "Sözlü Tarih Eğitim Atölyesi" düzenledi.
Prof. Aynur İlyasoğlu'nun yönettiği atölye çalışmasına, Mardin kenti ve çevresinden 23 kişi katıldı. Çalışmaya katılan Proje Koordinatörü Gürel Tüzün'le Proje Koordinatör Yardımcısı İrem Arf da projeyle ilgili ayrıntılı bilgi verdi.
Atölye katılımcıları, proje kapsamında Mardin ve çevresinde gerçekleştirilecek sözlü tarih görüşmelerini yürütecekler.
Mardin'in İnsan Manzarası projesi nedir?
Tarih Vakfı tarafından Avrupa Birliği desteğiyle yürütülmekte olan Mardin'in İnsan Manzarası projesi, çok kültürlü Mardin kenti ve çevresinin kültürel mirasının kayıt altına alınarak gelecek nesillere aktarılmasını hedefliyor.
Projenin sonunda oluşturulacak olan "Mardin Sanal Kültür Merkezi"nde, Mardinlilerin yemek kültüründen sözlü geleneklerine, üretim pratiklerinden doğum ve ölüm ile ilgili geleneklerine kadar tüm kültürel pratikleri hakkında araştırmacı, eğitmen ve öğrencilerin yararlanabilecekleri yazılı ve görsel malzemeler bulunacak.
Proje, Türkiye'nin kültürel açıdan son derece zengin olan bu yöresindeki kültürel pratiklerin kaybolmadan kayıt altına alınmasını sağlayacağı gibi, çeşitli kültürel grupların birbirlerini tanımalarına katkıda bulunarak toplumsal barışa da hizmet edecek.(AÖ/NZ)
- an example: (# Minimum amount that can be sought is $2,500, # Maximum amount for a single institution annually is $75,000; Maximum amount for a collaborative project annually is $150,000)
- list of USA funds:
- Applying for an ELDP grant: (2010 Grant Cycle - Categories: * Small Grants * Individual Graduate Scholarships * Individual Postdoctoral Fellowships * Major Documentation Projects )

- E-MELD-Grant-Proposal.html
- Proje_Hazirlama_kilavuzu.pdf
* Sürekli Eğitim / Continuing Education
     NC ECHO Digitization Institute: * project planning * the selection process * legal considerations * digital production and scanning * digital camera capture * creating metadata * standards and guidelines * equipment and technical specifications * web design * digital collection management * project management and evaluation * grant opportunities
     Encoded Archival Description Workshops / Grant Writing Workshop / Hometown History Workshops / Workshops to Go
- Tarih Vakfı
     sozlu_tarih_atolyeleri -
     Arşivcilik-Kütüphanecilik Çalışmaları -
     kilavuz_kitaplar -
About The LINGUIST List
* The aim of the list is to provide a forum where academic linguists can discuss linguistic issues and exchange linguistic information.
* With the aid of the publishing community and the contributions of the subscribers, it now offers over 20 fellowships to graduate students, who serve in return as editors of the list.
- Knock, knock!
- Who's there? 
- xxx
- xxx who?
- xxx ...!
new problems...
not only access, interpreted access...
deep difference in the organizational cultures and conceptual models...

"work" or object/subject...
repository      museum/archive                                   library
has             has                                              has

cultur.herit.   artifact        -----> subject of ----->         book etc.
linguistic m.   (Work)                                           (also Work)
ontology is not the real life (but tries to be) ... one-to-one map(ping)

"Of Exactitude in Science
...In that Empire, the craft of Cartography attained such Perfection that the Map of a Single province covered the space of an entire City, and the Map of the Empire itself an entire Province. In the course of Time, these Extensive maps were found somehow wanting, and so the College of Cartographers evolved a Map of the Empire that was of the same Scale as the Empire and that coincided with it point for point. Less attentive to the Study of Cartography, succeeding Generations came to judge a map of such Magnitude cumbersome, and, not without Irreverence, they abandoned it to the Rigours of sun and Rain. In the western Deserts, tattered Fragments of the Map are still to be found, Sheltering an occasional Beast or beggar; in the whole Nation, no other relic is left of the Discipline of Geography.
-From Travels of Praiseworthy Men (1658) by J. A. Suarez Miranda"

The piece was written by Jorge Luis Borges and Adolfo Bioy Casares. English translation quoted from J. L. Borges, A Universal History of Infamy, Penguin Books, London, 1975.

"...In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province. In time, those Unconscionable Maps no longer satisfied, and the Cartographers Guilds struck a Map of the Empire whose size was that of the Empire, and which coincided point for point with it. The following Generations, who were not so fond of the Study of Cartography as their Forebears had been, saw that that vast Map was Useless, and not without some Pitilessness was it, that they delivered it up to the Inclemencies of Sun and Winters. In the Deserts of the West, still today, there are Tattered Ruins of that Map, inhabited by Animals and Beggars; in all the Land there is no other Relic of the Disciplines of Geography."

Jorge Luis Borges, “On Exactitude in Science,” in Collected Fictions (trans. Andrew Hurley; New York: Viking, 1998), 325
"genre"s for the collection items
(from AILLA - a digital archive of recordings and texts in and about the indigenous languages of Latin America.
 based on IMDI ?)

References [top]
^ Aksoylar, Cem and Mutluergil, Süha Orhun and Erdoğan, Hakan 2009. Bir Türkçe konuşma tanıma sisteminin anatomisi (The anatomy of a Turkish speech recognition system). In: IEEE 17th Signal Processing and Communications Applications Conference, 2009 (SIU 2009), Antalya, Türkiye. 512-515.
   Sabancı Üni. Mühendislik - Official URL: - 978-1-4244-4436-6/09/$25.00 ©2009 IEEE
Bradley, Kevin (Ed. and IASA President and Vice chair of IASA Technical Committee) 2009. Guidelines on the Production and Preservation of Digital Audio Objects, 2nd Ed. March 2009.
Brants, Thorsten 2000. TnT - a statisticalpart-of-speechtagger.In ProceedingsoftheSixthApplied NaturalLanguageProcessingConferenceANLP-2000,Seattle,WA.
CLARIN 2009. Daan Broeder, Bertrand Gaiffe, Maria Gavrilidou, Erhard Hinrichs, Lothar Lemnitzer, Dieter Van Uytvanck, Andreas Witt, Peter Wittenburg 2009. Metadata Infrastructure for Language Resources and Technology. 2009-02-04 - Version 5.
Geoffrey Leech 2004. Developing Linguistic Corpora: a Guide to Good Practice - Chp.2 - Adding Linguistic Annotation - 
Salor 2002. Özgür Salor, Bryan Pellom, Tolga Çiloğlu, Kadri Hacıoğlu, Mübeccel Demirekler 2002. On Developing New Text and Audio Corpora and Speech Recognition Tools for the Turkish Language. In: ICSLP-2002 (7th International Conference on Spoken Language Processing), Denver, USA. p. 349-352.
   Prof. Mübeccel Demirekler:
Helmut Schmid 2005. A Programming Language for Finite State Transducers, Proceedings of the 5th International Workshop on Finite State Methods in Natural Language Processing (FSMNLP 2005), Helsinki, Finland. (ps, pdf) 
Helmut Schmid and Florian Laws 2008. Estimation of Conditional Probabilities with Decision Trees and an Application to Fine-Grained POS Tagging, COLING 2008, Manchester, Great Britain. 
   RFTagger... (pdf-
^ Nathan & Austin 2008. Metadata-thick-and-thin... 
     Nathan & Austin 2008.
     URL: eldp2008_6/resources