>The School of Cultural Texts and Records, Jadavpur University
>
>in collaboration with
>
>The Association for Literary and Linguistic Computing, UK
>
>presents
>
>
>
>DIGITAL TEXTS IN EDITABLE FORMAT
>
>with special reference to indic languages
>
>CONFERENCE, 7-8 FEBRUARY 2007 | WORKSHOP, 9-10 FEBRUARY 2007
>
>Directors: Sukanta Chaudhuri, Subha Chakraborty Dasgupta, Samar Bhattacharya,
>
>Anirban Ray Chaudhuri
>
>
>
>Both the conference and the workshop will cover the full range of issues
>
>relating to digitising of texts and the creation
>of digital archives. The major
>
>objectives will be
>
> to address the technical problems of digitising Indic scripts in editable
>
> format through Optical Character Recognition
>
> to address the textual and archival aspects
> of digitising texts and documents
>
> in Indic scripts: locating, compiling and editing the resources
>
>However, all presentations need not address texts in Indic languages. General
>
>papers on digital text technology are welcome,
>as also reports and analyses of
>
>representative projects in digital archiving in
>any language. While a good many
>
>presentations will be on Bengali texts, material on other Indic languages is
>
>particularly welcome. All presentations must be in English.
>
>
>
>THE CONFERENCE on 7-8 February 2007 will be for a general audience comprising
>
>all persons interested in archival, editorial
>and textual study, or in digital
>
>technology and literary and linguistic computing. It should be of interest to
>
>students of literature, history, linguistics, or
>any other discipline involving
>
>the study of texts and documents.
>
>THE WORKSHOP on 9-10 February 2007 is intended
>for a more specialised group of
>
>participants, with direct experience or strong
>interests in digital technology
>
>(especially font generation and OCR), electronic archiving and electronic
>
>editing. Please inform us separately if you wish
>to take part in the workshop.
>
>
>
>Most papers will be by invitation. For more information please contact:
>
>
>
>Sukanta Chaudhuri at
><mailto:[log in to unmask]>[log in to unmask]
>
>or Subha Chakraborty Dasgupta at <mailto:[log in to unmask]>[log in to unmask]
>
>or enquire by post from
>
>Sukanta Chaudhuri, Director, School of Cultural Texts and Records, Jadavpur
>University, Kolkata 700 032, India
>
>________________________________________________________________________________________________
>
>PROPOSED WORKSHOP ON
>
>DIGITISING INDIC TEXTS IN EDITABLE FORMAT
>
>TO BE JOINTLY ORGANISED BY
>
>THE ASSOCIATION FOR LITERARY AND LINGUISTIC COMPUTING, UK
>
>AND
>
>JADAVPUR UNIVERSITY, KOLKATA, INDIA
>
>
>
>CONFERENCE: 7-8 February 2007 WORKSHOP: 9-10 February 2007
>
>Event Directors: Sukanta Chaudhuri, Subha Chakraborty Dasgupta, Samar
>
>Bhattacharya, Anirban Ray Chaudhuri
>
>
>
>Aims and Scope:
>
>Both the Conference and the Workshop will
>address the digitization of texts in
>
>editable format with special reference to Indic
>languages, particularly Bengali.
>
>The conference will be for a bigger and less
>specialized group of participants:
>
>scholars of literature, history and all other disciplines involving textual
>
>study and archiving, as well as computer
>scientists and other technologists with
>
>an interest in literary and linguistic computing. The Workshop will be for
>
>fewer, technically specialized participants who have actually worked in this
>
>field. Others may attend as auditors.
>
>
>
>We propose Bengali as the language to focus on, as (a) the language of the
>
>region where Jadavpur University is located; and (b) a language where a good
>
>deal of work has already been done in the above
>respects, ensuring an informed
>
>and interactive milieu. Today, efficient word-processing programmes exist for
>
>all major Indian languages. Some work has been done on OCR programmes in
>
>Devanagari (Hindi) script, that being the country's official language, and in
>
>Brahmi-based scripts including Bengali. But
>other advanced functions, essential
>
>for textual study and editorial processing, have
>hardly been pursued anywhere in
>
>India. In Bengali, at least a foundation has
>been laid which can be consolidated
>
>in the Workshop.
>
> Of course we shall invite experts in other
> languages, as most of the issues
>
>are germane to their work as well. The experts from abroad may not have
>
>knowledge of Bengali or other Indic scripts.
>Rather, their contribution will be
>
>valuable precisely by virtue of drawing on a broader field of research and
>
>experience.
>
>
>
>The subject of the workshop and conference is of importance from two angles:
>
> technical: to solve the special problems of digitizing Indic scripts.
>
> scholarly, archival and bibliographical: to create literary archives and
>
> foster an editorial culture.
>
>(a) Technical:
>
>The special problems of many Indic alphabets,
>including Bengali, are as follows:
>
> Only consonants are written in full, with the accompanying vowel sounds
>
> indicated by tagged-on vowel markers.
> Although these vowel sounds phonetically
>
> follow the consonant, they are sometimes
> written before it, and at other times
>
> above or below. This makes font creation and screen visualization more
>
> difficult. It also makes certain functions of
> text analysis – e.g., phonetic
>
> analysis, or collation of texts with variant spellings – specially
>
> problematic, as the visual sequence presented on screen does not match the
>
> phonetic sequence followed during keying-in and hence registered in the
>
> processing unit.
>
> The Bengali alphabet has about 50 letters, without majuscule/miniscule
>
> variation. (The 'about' is significant: there is some debate as to what
>
> constitutes a letter.) There are also a huge
> number of conjunct letters (2, 3
>
> or even 4 conjunct consonants plus a vowel),
> besides a range of vowel tags (in
>
> several forms for each vowel, depending on
> the consonant it is attached to)
>
> and some 'half-letters' (consonants without
> vowels). All this vastly increases
>
> the number of glyphs, to a total of 450-500
> items in the average 'typecase'.
>
> Moreover, the forms of these conjuncts vary from font to font in print.
>
> Despite a trend towards simplification, most
> of these conjunct letters will be
>
> with us for a long time to come. And needless
> to say, they will always have to
>
> be processed in the case of extant texts.
>
> This makes it a great challenge both
>
> to generate these conjuncts in fonts for
> electronic use: though many Bengali
>
> fonts have been generated, they nearly
> always have a measure of glitches or
>
> instability; and
>
> to develop an OCR programme that can 'read'
> these conjuncts in extant print
>
> fonts. Again, much work has been done, but
> with an accuracy of 95%+ only in
>
> certain text situations. It is often no more than 85%.
>
>The proposed workshop would offer a rare chance
>for specialists from abroad to
>
>learn of these problems and the ways Indian experts are tackling them, and in
>
>turn to suggest new approaches and solutions based on their experience with
>
>Roman or other alphabets.
>
> The issues to be taken up could be:
>
> improving and stabilising Unicode fonts in Bengali and related Indian
>
> languages;
>
> improving and extending the OCR programmes
> developed so far, to create texts
>
> suitable for all kinds of textual and phonetic analysis;
>
> further mark-up of these texts to produce fully editable, collatable and
>
> searchable versions;
>
> improving an experimental collation programme already developed.
>
>All this would help us work towards the ultimate
>goal of ensuring that texts in
>
>Indic languages can, one day, be processed in
>every way considered standard for
>
>the Roman alphabet: search/concordance,
>collation, OCR, phonetic analysis etc.
>
>
>
>(b) Scholarly, archival and bibliographical
>
> Bengali has an extensive literature, whose 'modern' phase began in the
>
>early 19th century. It was the first Asian language to come into extensive
>
>contact with Western literature and thought.
> From the 19th century, it produced
>
>a great range of educational, social, religious and even
>
>scientific/technological works, as well as the first body of modern creative
>
>literature in any Indian language. This corpus
>forms the textual basis of what
>
>is often called the Bengal Renaissance, reaching its climax in the work of
>
>Rabindranath Tagore (1861-1941). Despite dramatic changes over the last
>
>half-century, Bengali literature and culture can still be said to live in the
>
>aftermath of the Bengal Renaissance.
>
> Bengali was also (except for a brief, soon closed chapter in western
>
>India) the first Asian language to achieve print. There is a huge body of
>
>printed texts from the late 18th century onwards. Quantity apart, the earlier
>
>material is seminal for Indian – indeed, world – printing history. It shows a
>
>specially interesting amalgam of Western
>techniques developed over 350-400 years
>
>with innovations specific to the local script and conditions of production.
>
> Bengal is famous for its vibrant literary culture, with a rich body of
>
>creative works and their scholarly interpretation. But so far as textual
>
>scholarship and editorial attention is concerned, this creative and critical
>
>activity is taking place in a near-vacuum.
>Hardly a score of Bengali texts are
>
>available in critical editions as international scholarship understands the
>
>term. Original 19th-century works are often hard
>to come by, surviving only in
>
>one or two copies, often badly preserved and
>deteriorating in the hot and humid
>
>climate.
>
> Combining the technical and scholarly
> imperatives, the Workshop would help
>
>us work towards the goal of ensuring that texts in Indic languages can be
>
>processed in every way considered standard for the Roman alphabet –
>
>search/concordance, collation, OCR, phonetic analysis etc. – and hence made
>
>accessible for all kinds of editorial and
>scholarly activity. Thus our two major
>
>needs would be served:
>
> to ensure the sheer physical record of this
> rich body of works in digital
>
> form.
>
> to generate an editorial culture by
> producing electronic texts in editable
>
> format.
>
>
>
>LOCAL PARTICIPANTS
>
> As stated above, the Conference on 7-8 February will attract a more
>
>general audience. The Workshop may have
>intensive participation by a core group
>
>of approx.15 local members. Another 15-20 persons – a few senior members, but
>
>chiefly young project and research staff – may
>attend to absorb the culture of
>
>electronic texts. These 'auditors' will be
>welcome to take active part, but they
>
>are unlikely to do so often. We hope, nonetheless, that they will feel
>
>encouraged to interact with the experts outside
>the workshop and in the future.
>
>In particular, young staff working on a single limited aspect of electronic
>
>texts will benefit greatly from this broader experience.
>
> Among the established scholars and
> workers in the field who, we hope, will
>
>attend the Workshop are the following. This list is neither confirmed nor
>
>complete.
>
> a) Professor Kalyan Kumar Datta and
> Professor Samar Bhattacharya, School
>
>of Education Technology, Jadavpur University:
>members of the 'Vidyasagar' group
>
>that developed the first Bengali electronic fonts.
>
> b) Professor Mita Nasipuri, Dr Anirban
> Ray Chaudhuri, and other members of
>
>the Department of Computer Science and Engineering, Jadavpur University
>
>associated with CMATER, an OCR development
>centre attached to their Department.
>
> c) Professor Bidyut Baran Chaudhuri, Indian Statistical Institute,
>
>Kolkata, who developed the first viable OCR programme in Bengali.
>
> d) Professor Ashok Mukhopadhyay, sometime Professor of Printing
>
>Engineering, Jadavpur University and CEO of the University Press attached to
>
>Visva-Bharati, the university founded by
>Rabindranath Tagore and till recently
>
>custodian of his works.
>
> e) Professor Gautam Sengupta, Professor of Linguistics, University of
>
>Hyderabad: a noted applied linguist with much work on Bengali fonts and
>
>electronic texts.
>
> f) Professor Palash Baran Pal and
> Professor Somendra Mohan Bhattacharya,
>
>Saha Institute of Nuclear Physics, Kolkata: physicists who have also worked
>
>extensively on Bengali fonts, word-processing programmes and online text
>
>databases.
>
> g) Members of local software groups –
> professional, semi-professional and
>
>amateur – such as the 'Ankur' group, who are working with Bengali electronic
>
>fonts and texts.
>
> h) Members of the School of Cultural Texts and Records, Jadavpur
>
>University: literary and humanistic scholars
>with expertise in electronic texts:
>
>e.g., Professor Subha Chakraborty Dasgupta,
>Professor Amlan Das Gupta, Dr Moinak
>
>Biswas, Dr Amitava Das, Dr Samantak Das, Dr Abhijit Gupta, Dr Rimi B.
>
>Chatterjee.
>
> Among younger delegates and 'auditors', we would specially welcome the
>
>young project staff and ancillary workers attached to the School of Cultural
>
>Texts and Records, Jadavpur University, and various relevant units of the
>
>Faculty of Engineering and Technology.
>
>Note on Jadavpur University:
>
>Jadavpur University began as a technological
>institution. But it is unique among
>
>Indian universities in that, over the last 20-30
>years, it has built up one of
>
>India's most successful Arts Faculties,
>including four departments of language
>
>and literature: Bengali, Comparative Literature, English and Sanskrit. It is
>
>arguably the most appropriate venue in India for literary and linguistic
>
>computing.
>
> It has already made many contributions in
> the field. The first electronic
>
>Bengali fonts were developed here. The people who developed them are still
>
>around (chiefly attached to the School of
>Education Technology), and will take
>
>part in the Workshop. Notable work on Bengali and Devanagari OCR is currently
>
>going on in the CMATER Centre of the Department of Computer Science and
>
>Engineering: they have developed 'Anulikhan',
>the first OCR system in any Indic
>
>script with editable format and original layout retention.
>
> The School of Cultural Texts and Records
> (comprising technologists as well
>
>as literary scholars and historians) conducts a
>range of textual projects using
>
>electronic resources. These include an experimental collation software – the
>
>first in any Indic language. The School already
>has a major digital archive of
>
>Bengali literary manuscripts in non-editable form, various bibliographical
>
>databases including the first Short-Title
>Catalogue in any Indian language, and
>
>a large music archive in digitized form. There
>is interaction between members of
>
>the Arts Faculty and the Faculty of Engineering and Technology in matters of
>
>textual computing. We hope the Workshop will enhance this.
>
>
>
>ALLC PARTICIPANTS: To be provided by ALLC. The
>team will be led by Prof. Laszlo
>
>Hunyadi, Chairman, Department of General and
>Applied Linguistics and Director,
>
>Centre for Digital Humanities, University of Debrecen, Debrecen, Hungary.
>
>
>
>PROGRAMME SCHEDULE (to be finalised after consultation with the ALLC):
>
> The programme of papers for the
> Conference will be finalized later. For
>
>the Workshop, we propose two sessions each day, divided as follows:
>
> Improvement of Unicode font generation in Bengali, for manual keying-in of
>
> texts in editable format.
>
> Degraded document processing and core OCR engine.
>
> Advanced OCR system with enhanced graphical user interface for document
>
> digitization in editable format retaining original layout.
>
> 4. Further development of the collation
> programme for Indic scripts already
>
> available with the School of Cultural Texts and Records.
>
>
>
>
>
>WORKSHOP OUTCOMES AND DISSEMINATION:
>
>As remarked above, despite the rich textual heritage in Bengali, there is
>
>relatively little textual and editorial
>awareness. Today, electronic resources
>
>can enable us to achieve this awareness, and thus leap-frog into an advanced
>
>editorial culture, in a relatively short span of
>time. Once we have a 'bank' of
>
>digital texts in editable format, we can proceed
>to electronic editing and other
>
>advanced processing of texts.
>
> This calls for the close interaction of
> textual and documentary scholars
>
>in Indic languages with experts in electronic
>texts and literary computing. As
>
>yet, there is little such contact. We hope the
>Workshop will help to bridge the
>
>gap. It will bring technological and moral support to the people actually
>
>working on electronic texts in Bengali and other
>Indic languages, and help them
>
>find their place in an international context. At
>the same time, it will foster a
>
>more informed level of operational skill among general textual and literary
>
>scholars. Delegates from both these categories will form the first group of
>
>beneficiaries.
>
> There will be a 'spread effect' extending
> to other Indian languages, where
>
>the problems are often the same. Experts in
>electronic texts in these language
>
>areas would constitute a second tier of beneficiaries.
>
> There would also be 'spread effect' of another sort: raising textual
>
>awareness, and exercising that awareness through electronic texts, among all
>
>students and archivists of Indic languages, literature and history – in fact,
>
>any discipline requiring textual documentation. This raising of consciousness
>
>would confer an unquantifiable benefit at a third level.
>
>
++++++++++++++++++++++++++++++++++++++++
Simon Tanner
Director, King's Digital Consultancy Services
King's College London
Kay House, 7 Arundel Street, London WC2R 3DX
tel: +44 (0)20 7848 1678 or +44 (0)7887 691716
email: [log in to unmask]
www.digitalconsultancy.net
Connecting Culture and Commerce Conference: January 2007, National Gallery
http://www.digitalconsultancy.net/mcg2007/
|