FIT5166 Information Retrieval Systems Practical Assignment

    FIT5166 Information Retrieval Systems Practical Assignment – Semester 2 2018

    Your task is to write an information retrieval engine, which will be able to index a collection of documents, and in response to a keyword query, retrieve matching documents. The information retrieval model your program will use is the vector-space model.
    You must follow all of the instructions below:
    SEPARATE SUBMISSIONS ARE REQUIRED FOR THE CREDIT LEVEL ASSIGNMENT AND THE HIGH-DISTINCTION LEVEL ASSIGNMENT (IF ATTEMPTING THE HIGH- DISTINCTION LEVEL).

    I. DUE DATE
    • Due Date is Saturday 6/10/2018 5pm
    • Language to be used: Python
    • No packages are allowed to be used for tokenization apart from reg ex and packages related to search engines can not be used
    • Main tasks to be done are :
    o Tokenisation, stopword removal, stemming, indexing, tf-idf, cosine similarity, precision and recall, explicit relevance feedback and spelling correction
    o A report is required as mentioned in following document

    II. INSTRUCTIONS FOR THE CREDIT LEVEL ASSIGNMENT (MAXIMUM MARK 69%)

    1. Your program can be written in Java, Python or any other programming language of your choice. Note that since programming skills are pre-requisite of this unit, your tutor is not to help you with the coding part of the assignment.

    2. All your programming source files must be submitted as specified in Section III, and must all follow the standard convention of having a file extension depending on the programming language you use (e.g. .py) Do not use package statements in your code.

    3. The name of your program must be MySearchEngine (i.e. at a minimum your source code directory must contain a file called MySearchEngine.py which contains the main() method). You may split your code into multiple source files, as long as they compile to produce the final MySearchEngine.class file by issuing the command in instruction #4.

    4. It must be possible to compile your program on the server by issuing the relevant runtime command from within the source code directory e.g.

    javac *.java

    5. Your program should be able to run from the command line and send its output to standard output (except for the index referred to in instruction #6, which is to be stored as a file).

    6. Your program must be able to be invoked from the command line with the following usage/parameters:

    java MySearchEngine [command]

    where [command] is one of:

    a. index collection_dir index_dir stopwords.txt

    index all the documents stored in collection_dir. The index so-constructed should be stored in index_dir. The index file should be named index.txt. See instructions #8 and #9 for the prescribed tokenization/stemming rules and index format. Stopwords are contained in the file stopwords.txt, a plain text file with one stopword per line. Do not consider the stopwords in the file stopwords.txt for stemming into index terms.

    for example:
    java MySearchEngine index ~/mydocs ~/myindex ~/stopwords.txt

    b. search index_dir num_docs keyword_list

    return a ranked list of the top num_docs documents that match the query specified in keyword_list. The most relevant document must appear first in the list. Note that keywords in the query are separated by white space on the command line. Refer to instruction #9 for a more detailed description of what should be returned by this command.

    for example:
    java MySearchEngine search ~/myindex 10 monash university

    7. When indexing documents, your program must first perform appropriate tokenization and stemming on the source document content.

    You can assume the source documents will be English language and in plaintext. Tokenization of the documents must follow these rules:

    a. Any words hyphenated across a line break must be joined into a single token (with the final token not containing the hyphen).
    b. Email addresses, web URLs and IP addresses must be preserved as a single token.
    c. Text within single quotation marks or inverted commas (e.g. ‘Word Press’) should be placed in single token.
    d. Two or more words separated by whitespace, all of which begin with a capital letter, must be preserved as a single token (i.e. include the whitespace in the token).
    e. Acronym should be preserved as a single token with or without full stop or period (e.g. C.A.T can result in CAT or C.A.T)
    f. For all other text, split the text into tokens using as delimiters either whitespace of elements of the following subset of punctuation: {.,:;”’()?!}

    (note this set includes the braces themselves).

    After tokenization, tokens must be stemmed into index terms using the Porter stemmer. You may use code from the following website to implement the Porter stemmer (remember to reference the website in the comments in your code):

    http://tartarus.org/martin/PorterStemmer/
    8. Each record in your index must have the following format (with fields separated by commas, lines separated by the end of line character and any non-integer quantities rounded to 3 decimal places). Inverse document frequencies should be calculated using natural log. Also, the denominator of the classical idf formula should be incremented by one to allow for query terms that do not appear in the index. Note, below, {} indicates a repeating group that must appear at least once, but the {} characters will not appear in your index.

    For example, suppose in a corpus of 10 documents, that the stemmed term cat appears twice in document d4 and once in both documents d6 and d7. Then its index entry will be:

    The document-id (doc-id) will be the simple filename of the document (e.g. the text that follows the last directory separator character in the absolute pathname of the file)

    9. When used with the search parameter, your program will return a ranked list of documents (i.e. in decreasing order of cosine similarity) matching the query (as represented by the user-supplied query terms). There will be one line in your output for each returned document. The format of each line in your output must be (cosine-score rounded to 3 decimal places):

    10. To submit the credit level assignment, follow the appropriate instructions Section III.

    III. INSTRUCTIONS FOR THE HIGH-DISTINCTION LEVEL ASSIGNMENT (MAX MARK 100%):

    1. Students may wish to gain further marks by extending the capability of their engine. To do so, you must first implement all of the instructions for the credit level. Remember to keep the high-distinction submission separate from the submission for the CREDIT level (refer to Section III).

    2. Seek your tutor’s approval of how you wish to extend your program. For extensions worthy of the HD grade1 by sending an email, describing what additional capabilities you wish to add, with the following subject-line:

    Student-id-number FIT5166 HD extension

    If your proposal is considered to be worthy of the HD grade should it be successfully implemented, they will send you email approval. Only then may you implement the changes in your code.

    3. Document the nature of your extensions, how they might improve the indexing and/or retrieval process and provide instructions as to how to use your program.

    4. To submit the high-distinction level assignment, follow the appropriate instructions in Section III.

    IV. ASSIGNMENT SUBMISSION INSTRUCTIONS

    For assignment specifications, refer to documents provided separately.

    Please follow these instructions exactly. Any amendments/clarifications will be posted on the unit website.

    Plagiarism warning:
    All assignment submissions will be put through a plagiarism detection software which automatically checks for their similarity with respect to other submissions in all years, and websites. Any plagiarism found will trigger the Faculty’s relevant procedures and may result in severe penalties, up to and including exclusion from the University.
    Make sure you properly reference any code and resources that you submit but has been done by other people.

    1 Generally, extensions will require modifications of both the indexing and searching components to achieve the HD grade.

    Students are to submit the CREDIT level and, if applicable, the HIGH DISTINCTION level assignment with the following details.
    Each submission is to include an Experiment Report of the tests that are conducted, the results of each test, the analysis and conclusion drawn from the experiments.
    Assignment Credit Level

    1. Ensure that before the due date/time, all your python source code files and test data files are to be zipped into a file called FirstName-Surname-Assignment-Credit.zip.

    2. Submit the Experiment Report (if not submitting HD level) and the zip file online on Moodle.

    You will be required to attend an interview regarding your assignment submission.

    Assignment High Distinction Level

    1. Ensure that before the due date/time, all your java source code files and test data files are to be zipped into a file called FirstName-Surname-Assignment-HD.zip.

    2. Submit the Experiment Report and the zip file online on Moodle.

    You will be required to attend an interview regarding your assignment submission for your assignment to be marked.

    Order for this paper or request for a similar assignment by clicking order now below

    Order Now