Call for Papers:
Workshop: Uses and Users for Parallel Corpora in
the Translation Process
Association for Machine Translation in the Americas
(AMTA)
November 4-5, 2010 Denver,
Colorado
(in conjunction with the American Translators
Association Conference)
The purpose of this workshop is to explore the uses that the
translation community is currently applying, and will apply, to parallel
corpora. A parallel corpus generally refers to a large collection of
translated text. These texts are often aligned at the sentence or phrase
level and annotated with a specific task in mind, motivating a markup
schema. Bilingual parallel texts are referred to as bitext, whereas
parallel corpora can be multilingual (e.g. the many translations of the
Bible.)
Submissions will address and explore the many reasons why
people create corpora, what corpora they would like to see created, how
translators are making use of corpora, how translations systems are utilizing
corpora according to type and structure, and what the privacy and copyright
issues are which accompany the many uses, both by machine and by
people.
Collections of parallel corpora abound, whereas definitions
and structuring of corpora seem to vary across sites[1]. Examples of the
kinds of differences involve source text markup, transliteration, target text
markup, methods of associating source and target, and alignment.
Processing needed for different applications varies widely
according to context and function; for example, how granular do associations
between source and target need to be, how much tagging needs to occur
(morphological, syntactic, semantic), what types of alignment are needed for
which purposes, and how much of the markup is manual or automatic.
Furthermore, given the wide range of preprocessing needs, what is the quality
check process as part of the overall workflow?
When using corpora to aid in human translation,
especially in conjunction with Translation Memory software, which
representation standards are being, or should be applied ( for example tmx, tbx,
srx, xml:tm, etc) and what are some of the compatibility issues encountered.
Finally, what are the standard existing uses for various
kinds of parallel corpora, and what are some of the nascent needs that could
only be explored once massive amounts of data are collected. Some of these
uses and users may simply need smaller amounts of data, but still require backup
corpora for validation and extension of data. What can translators expect
from parallel corpora? Of what use are these resources for others in the
translation industry, be it government or industry or academia.
Two of the issues addressed only gingerly in the translation
community are those of privacy and permissions for copyrighted text,
particularly when dealing with limited extraction of say technical terms and
their translations that could in no way be used to reconstruct the sources. A
liberal interpretation might claim that it does not constitute an invasion of
privacy (in corpora that consist of logs, chats, emails, etc), nor is it an
infringement of copyright. On the other hand, a more conservative interpretation
of privacy or infringement might claim that this use does constitute misuse.
Most people either overlook these issues or are blocked from progress with the
more cautious approach.
The types of questions that this workshop will address
include:
·
How you create parallel
corpora as part of your workflow?
·
In what ways do you use
parallel corpora?
·
What techniques do you use
to evaluate usefulness (for people or systems)?
·
How are your corpora
processed (Aligned? Markup? Standards?)
·
What kinds of quality
ratings do you use?
·
What are your lessons
learned?
Proposers are encouraged to participate by sharing their
experiences, projects, needs and findings as a single contributor or as a member
of a panel. Of interest is the workflow process of creating or finding,
processing, standardizing and using parallel or comparable corpora for improving
language training of humans and machines. Furthermore, if participants
have developed a financial return on investment scenario for using parallel
corpora, those insights and justifications are also welcome as presentation
topics.
Organizers:
Judith L. Klavans – U.S. Government and University of
Maryland
Elizabeth McGrath – MITRE Corporation
Trent Rockwood – MITRE Corporation
Important Dates and Schedule:
June 10, 2010 – send out call for papers
July 20, 2010 – papers due
August 20, 2010 – send reviews back to
submitters
September 10, 2010 – revisions due back to AMTA for
printing
November 4-5, 2010 – workshop dates
Format
6 page max, 11pt minimum, 2 column, ACM format:
http://www.acm.org/sigs/publications/proceedings-templates
Submissions and questions to: Trent Rockwood, [log in to unmask]
[1] A few examples of major
collections include the Linguistic Data Consortium (www.ldc.upenn.edu), the British
National Corpus (bnc.org), the JRC-Acquis corpora http://wt.jrc.it/lt/Acquis/, the
ELRA MLCC Multilingual and Parallel Corpora http://catalog.elra.info, Japanese-Chinese
corpora www.nict.go.jp/, and many
others.
This transmission is intended for the named addressee(s) only and may contain sensitive or protectively marked material up to RESTRICTED and should be handled accordingly. Unless you are the named addressee (or authorised to receive it for the addressee) you may not copy or use it, or disclose it to anyone else. If you have received this transmission in error please notify the sender immediately. All traffic including GCSx may be subject to recording and/or monitoring in accordance with relevant legislation
For the full disclaimer please access http://www.rctcbc.gov.uk/disclaimer
Mae'r neges ar gyfer y person(au) a enwyd yn unig a gall gynnwys deunydd sensitif neu ddeunydd sy wedi'i farcio hyd at 'CYFYNGEDIG' a dylid ei thrin yn unol a hynny. Os nad chi yw'r person a enwyd (neu os nad oes gyda chi'r awdurdod i'w derbyn ar ran y person a enwyd) chewch chi ddim ei chopio neu'i defnyddio, neu'i datgelu i berson arall. Os ydych wedi derbyn y neges ar gam a wnewch roi gwybod i'r sawl sy wedi anfon y neges ar unwaith. Mae modd cofnodi a/neu fonitro holl negeseuon GCSX yn unol a'r ddeddfwriaeth berthnasol.
I weld yr ymwadiad llawn ewch i http://www.rctcbc.gov.uk/ymwadiad