Header image
In conjunction with FIRE 2016
[Testsets have been emailed to the participants. Please check your SPAM if you have not received the mail.]
 

 

A large number of languages, including Arabic, Russian, and most of the South and South East Asian languages, are written using indigenous scripts. However, often the websites and the user generated content (such as tweets and blogs) in these languages are written using Roman script due to various socio-cultural and technological reasons. This process of phonetically representing the words of a language in a non-native script is called transliteration. Transliteration, especially into Roman script, is used abundantly on the Web not only for documents, but also for user queries that intend to search for these documents. This situation, where both documents and queries can be in more than one scripts, and the user expectation could be to retrieve documents across scripts is referred to as Mixed Script Information Retrieval.


History

Two pilot subtasks on transliterated search were introduced as a part of FIRE 2013. Subtask 1 was on language identification of the query words and then transliteration of the Indian language words. The subtask was conducted for three Indian languages - Hindi, Bangla and Gujarati. Subtask 2 was on ad hoc retrieval of Bollywood song lyrics - one of the most common forms of transliterated search that commercial search engines have to tackle. Five teams had participated in the shared task.

In FIRE 2014, the scope of subtask 1 was extended to cover three more South Indian languages - Tamil, Kannada and Malayalam. In subtask 2, we introduced (a) queries in Devanagari script, and (b) more natural queries with splitting and joining of words. More than 15 teams participated in the tasks.

In FIRE 2015, the shared task was renamed from "Transliterated Search" to "Mixed Script Information Retrieval" for aligning it to the framework proposed by (Gupta et al. 2014). Three subtasks were conducted. Subtask 1 was extended further by including more Indic languages, and transliterated text from all the languages were mixed. Subtask 2 was on searching movie dialogues and reviewed along with song lyrics. Mixed script question answering (MSQA) was introduced as Subtask 3.

 

Task Description

Subtask 1: Code-Mixed Cross-Script Question Classification

Being a classic application of natural language processing, question answering (QA) has practical applications in various domains such as education, health care, personal assistance, etc. QA is a retrieval task which is more challenging than the task of common search engine because the purpose of QA is to find accurate and concise answer to a question rather than just retrieving relevant documents containing the answer (Li and Roth, 2002). Recently, Banerjee et al. (2015) formally introduced the code-mixed cross-script QA research problem. The first step of understanding a question is to perform question analysis. Question classification is an important task of question analysis which detects the answer type of the question. Question classification helps not only filter out a wide range of candidate answers but also determine answer selection strategies (Li and Roth, 2002). Furthermore, it has has been observed that the performance of question classification has significant influence on the overall performance of a QA system.

Let, Q = {q1, q2, . . . , qn} be a set of factoid questions written in Romanized Bengali along with English (i.e., it also contains English words and phrases). Let C = {c1, c2,…,cn} be the set of question classes. The task is to classify each given question into one of the predefined coarse-grained classes.

Language: Code-mixed Bengali-English

Example:
Question: last volvo bus kokhon chare ?
Question Class: TEMPORAL

Data and Resources:
A dataset of questions tagged with question classes will be released as training data for this task. Participants can use any other resources that they have access to.
Each entry in the dataset has the format: q_no q_string q_class
Where, q_no, q_string and q_class refer to question number, code-mixed cross-script question string and the class of the question respectively.
Example: last volvo bus kokhon chare ? TEMPORAL

 

Subtask 2: Information Retrieval on Code­Mixed Hindi­English Tweets

Social media has become almost ubiquitous in present times. Such proliferation leads to automatic
information processing need and has various challenges. Social media content is mostly informal in nature. Additionally while talking about Indian social media, users often prefer to use Roman transliterations of their native languages and English embedding. Therefore, Information retrieval (IR) on such Indian social media data is a challenging and difficult task when the documents and the queries are a mixture of two or more languages written in either the native scripts and/or in the Roman transliterated form. Recently, Chakma and Das (2016) have formally introduced the problem of Code­Mixed Information Retrieval (CMIR) for Hind­English Tweets, where they have emphasized issues related with Information Retrieval (IR) for Code­Mixed Indian social media texts, particularly texts from twitter. CMIR was motivated by the work of Gupta et. al. (2014).

Let L = {l1 ,l2 ,l3 ,…, ln ,} be the set of natural languages and S = {s1, s2, s3, …, sn} be the set of scripts in which the languages are written. Let us assume that a word w written in a language with a particular script be denoted as ⟨li ,sj⟩ . When i = j , we say that the word is written in its native script otherwise, it is in transliterated form. Let q be a query given over a set of documents D where the IR task is to rank the
documents in D so that the documents most relevant to q appear at the top. For a bilingual query q , where the query terms may be written in Hindi and English both, we can assume that either q∈ ⟨li⋃lj , si⋃sj⟩ or q∈ ⟨li ⋃lj , si ⟩ or q ∈ ⟨ li ⋃lj , sj⟩ .

Language: Code-mixed Hindi-English

Example:
Queries: a) netaji ki files, b) netaji ke files
Query Description: Information sought for declassification of the Netaji Subhash Chandra Bose Files by Indian Government.

Data and Resources:
The data will contain:
-Few thousands tweets covering various topics. ​ Note that only the text of the tweets will be provided, without any information about who posted the tweets, etc.
-A set of opics in TREC format, each containing a title, a brief description, and a more detailed narrative on what type of tweets will be considered relevant to the topic.

How to Participate

  • Who can participate: The shared task is open to all. Students, faculty members and researchers, as well as engineers from industry are all welcome to participate in the shared task. Participation will be in teams, where a team can consist of one or more members. There is no upper limit on the number of members in a team (though we believe 2 to 4 are the optimal team size for these tasks).
  • Registration: It is mandatory for a team to register for this shared-task to participate. The test and training data will be sent through emails only to registered teams. Click here to register.
  • Which subtasks: A team can choose to participate in all or two or one of the subtasks.
  • How many runs: A team can submit up to three runs per subtask. A "run" is defined as an output for the test set from a particular system. If you want to try out more than one systems on our test data (which might be because you are not sure which system will perform the best or you are curious to know how slightly different systems that you have built compare), you can submit multiple runs (up to 3).


Important Dates

  • Registration for the task begins: 20th July 2016
  • Training/Dev data release: 15th August 2016
  • Registration closes: 18th August 2016
  • Test Set release: 10th September, 17:00 Hrs IST
  • Submit Run: 12th September, 17:00 Hrs IST
  • Results distributed: TBA
  • Working Notes submission deadline: TBA
  • Working Notes reviews: TBA
  • Working Notes final versions due: TBA
  • FIRE Workshop: 8-10th Dec 2016


References

  • Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo Rosso. 2014. Query Expansion for Mixed-script Information Retrieval. In: The 37th Annual ACM SIGIR Conference, SIGIR-2014, Gold Coast, Australia, June 6-11, pp. 677-686.
  • Li Xin, and Dan Roth. 2002. Learning question classifiers. In: 19th International Conference on Compuatational Linguistics (COLING), pp. 556–562.
  • Somnath Banerjee, Sudip Kumar Naskar, Paolo Rosso, and Sivaji Bandyopadhyay. 2016. The First Cross-Script Code-Mixed Question Answering Corpus. In: Modeling, Learning and Mining for Cross/Multilinguality Workshop, 38th European Conference on Information Retrieval (ECIR), pp.56-65.
  • Royal Sequiera, Monojit Choudhury, Parth Gupta, Paolo Rosso, Shubham Kumar, Somnath Banerjee, Sudip Kumar Naskar, Sivaji Bandyopadhyay, Gokul Chittaranjan, Amitava Das, and Kunal Chakma. 2015. Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval.In: Forum for Information Retrieval Evaluation (FIRE), pp. 19-25.
  • Björn Gambäck, and Amitava Das.2016. Comparing the Level of Code­Switching in Corpora. In: 10th edition of the Language Resources and Evaluation Conference (LREC), 23­28 May 2016, Portorož (Slovenia).
  • Anupam Jamatia, Björn Gambäck, and Amitava Das. 2016. Collecting and Annotating Indian Social Media Code­Mixed Corpora. In: 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING), April 3–9, Konya, Turkey.
  • Kunal Chakma, and Amitava Das. CMIR:A Corpus for Evaluation of Code Mixed Information Retrieval of Hindi­English Tweets. 2016. In: 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING), April 3–9, Konya, Turkey.
  • Anupam Jamatia, Björn Gambäck, and Amitava Das. 2015. Part­of­Speech Tagging for Code­Mixed English­Hindi Twitter and Facebook Chat Messages. In: 10th Recent Advances of Natural Language Processing (RANLP), September, pp. 239–248.
 
 

 

 

News

19/6/2016: Registration for the shared
task is now open. Please register your
team through 
this link.

Contact


Task Coordinators

  • Monojit Choudhury, Microsoft Research

  • Somnath Banerjee, Jadavpur University

  • Sudip Kumar Naskar, Jadavpur University

  • Paolo Rosso, Technical University of Valencia

  • Sivaji Bandyopadhyay, Jadavpur University

  • Amitava Das, IIIT Sriharikota

  • Kunal Chakma, NIT Agartala


Useful Links