Fact Extraction and VERification Over Unstructured and Structured information

Overview

Fact Extraction and VERification Over Unstructured and Structured information

This repository maintains the code to generate and prepare the dataset, as well as the code of the annotation platform used to generate the FEVEROUS datset. Visit http://fever.ai to find out more about the shared task.

Install Requirements

Create a new Conda environment and install torch:

conda create -n feverous python=3.8
conda activate feverous
conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 -c pytorch

Then install the package requirements specified in src/requirements.txt. Then install the English Spacy model python -m spacy download en_core_web_sm. Code has been tested for python3.7 and python3.8.

Prepare Data

Call the following script to download the FEVEROUS data:

./scripts/download_data.sh 

Or you can download the data from the FEVEROUS dataset page directly. Namely:

  • Training Data
  • Development Data
  • Wikipedia Data as a database (sqlite3)

After downloading the data, unpack the Wikipedia data into the same folder (i.e. data).

Reading Data

Read Annotation Data

To process annotation files we provide a simple processing script annotation_processor.py. The script currently does not support the use of annotator operations.

Read Wikipedia Data

This repository contains elementary code to assist you in reading and processing the provided Wikipedia data. By creating a a WikiPage object using the json data of a Wikipedia article, every element of an article is instantiated as a WikiElement on top of several utility functions you can then use (e.g. get an element's context, get an element by it's annotation id, ...).

from database.feverous_db import FeverousDB
from utils.wiki_page import WikiPage

db =  FeverousDB("path_to_the_wiki")

page_json = db.get_doc_json("Anarchism")
wiki_page = WikiPage("Anarchism", page_json)

context_sentence_14 = wiki_page.get_context('sentence_14') # Returns list of context Wiki elements

prev_elements = wiki_page.get_previous_k_elements('sentence_5', k=4) #Gets Wiki element before sentence_5
next_elements = wiki_page.get_next_k_elements('sentence_5', k=4) #Gets Wiki element after sentence_5

WikiElement

There are five different types of WikiElement: WikiSentence, WikiTable, WikiList, WikiSection, and WikiTitle.

A WikiElement defines/overrides four functions:

  • get_ids: Returns list of all ids in that element
  • get_id: Return the specific id of that element
  • id_repr: Returns a string representation of all ids in that element
  • __str__: Returns a string representation of the element's content

WikiSection additionally defines a function get_level to get the depth level of the section. WikiTable and WikiList have some additional funcions, explained below.

Reading Tables

A WikiTable object takes a table from the Wikipedia Data and normalizes the table to column_span=1 and row_span=1. It also adds other quality of life features to processing the table or its rows.

wiki_tables = wiki_page.get_tables() #return list of all Wiki Tables

wiki_table_0 = wiki_tables[0]
wiki_table_0_rows = wiki_table_0.get_rows() #return list of WikiRows
wiki_table_0_header_rows = wiki_table_0.get_header_rows() #return list of WikiRows that are headers
is_header_row = wiki_table_0_rows[0].is_header_row() #or check the row directly whether it is a header


cells_row_0 = wiki_table_0_rows[0].get_row_cells()#return list with WikiCells for row 0
row_representation = '|'.join([str(cell) for cell in cells_row_0]) #get cell content seperated by vertical line
row_representation_same = str(cells_row_0) #or just stringfy the row directly.

#returns WikiTable from Cell_id. Useful for retrieving associated Tables for cell annotations.
table_0_cell_dict = wiki_page.get_table_from_cell_id(cells_row_0[0].get_id())

Reading Lists

wiki_lists = wiki_page.get_lists()
wiki_lists_0 = wiki_lists[0]
#String representation: Prefixes '-' for unsorted elements and enumerations (1., 2. ...) for sorted elements
print(str(wiki_lists_0))

wiki_lists[0].get_list_by_level(0) #returns list elements by level

Baseline

Retriever

Our baseline retriever module is a combination of entity matching and TF-IDF using DrQA. We first extract the top $k$ pages by matching extracted entities from the claim with Wikipedia articles. If less than k pages have been identified this way, the remaining pages are selected by Tf-IDF matching between the introductory sentence of an article and the claim. To use TF-IDF matching we need to build a TF-IDF index. Run:

PYTHONPATH=src python src/baseline/retriever/build_db.py --db_path data/feverous_wikiv1.db --save_path data/feverous-wiki-docs.db
PYTHONPATH=src python src/baseline/retriever/build_tfidf.py --db_path data/feverous-wiki-docs.db --out_dir data/index/

We can now extract the top k documents:

PYTHONPATH=src python src/baseline/retriever/document_entity_tfidf_ir.py  --model data/index/feverous-wiki-docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --db data/feverous-wiki-docs.db --count 5 --split dev --data_path data/

The top l sentences and q tables of the selected pages are then scored separately using TF-IDF. We set l=5 and q=3.

PYTHONPATH=src python src/baseline/retriever/sentence_tfidf_drqa.py --db data/feverous_wikiv1.db --split dev --max_page 5 --max_sent 5 --use_precomputed false --data_path data/
PYTHONPATH=src python src/baseline/retriever/table_tfidf_drqa.py --db data/feverous_wikiv1.db --split dev --max_page 5 --max_tabs 3 --use_precomputed false --data_path data/

Combine both retrieved sentences and tables into one file:

PYTHONPATH=src python src/baseline/retriever/combine_retrieval.py --data_path data --max_page 5 --max_sent 5 --max_tabs 3 --split dev

For the next steps, we employ pre-trained transformers. You can either train these themselves (c.f. next section) or download our pre-trained models directly (We recommend training the model yourself as the version used in the paper has not been trained on the full training set). The Cell extraction model can be downloaded here. Extract the model and place it into the folder models.

To extract relevant cells from extracted tables, run:

PYTHONPATH=src python src/baseline/retriever/predict_cells_from_table.py --input_path data/dev.combined.not_precomputed.p5.s5.t3.jsonl --max_sent 5 --wiki_path data/feverous_wikiv1.db --model_path models/feverous_cell_extractor

Verdict Prediction

To predict the verdict given either download our fine-tuned model here or train it yourself (c.f. Training). Again, we recommend training the model yourself as the model used in the paper has not been trained on the full training set. Then run:

 PYTHONPATH=src python src/baseline/predictor/evaluate_verdict_predictor.py --input_path data/dev.combined.not_precomputed.p5.s5.t3.cells.jsonl --wiki_path data/feverous_wikiv1.db --model_path models/feverous_verdict_predictor

Training

TBA

Evaluation

To evaluate your generated predictions locally, simply run the file evaluate.py as following:

python evaluation/evaluate.py --input_path data/dev.combined.not_precomputed.p5.s5.t3.cells.verdict.jsonl

Note that any input file needs to define the fields label, predicted_label, evidence, and predicted_evidence in the format specified in the file feverous_scorer.

Shared Task submission

TBA

Comments
  • Bugfix/pkg import paths

    Bugfix/pkg import paths

    Saw this error running the lastest version (0.0.3)

    Traceback (most recent call last):
      File "src/feverous/baseline/predictor/train_verdict_predictor.py", line 19, in <module>
        from utils.annotation_processor import AnnotationProcessor, EvidenceType
    ModuleNotFoundError: No module named 'utils'
    

    This PR fixes the above path bugs. Now that the src code is being packed under a feverous folder, any global imports should start with "feverous." so that python can find the submodules etc.

    There's a mix of global and relative path imports in the code base so I wasn't sure which one you wanted, I've gone with global imports for now since the scripts can still be run directly then

    Note: This fix is required for the other PR (https://github.com/Raldir/FEVEROUS/pull/17) which I have put into draft pending this PR

    opened by creisle 8
  • KeyError on missing ID

    KeyError on missing ID

    When I run the steps in the README I get an error in one of the downstream steps due to missing an ID field. I did a basic grep through the code and found a commented out line that sets this. I re-added it and was able to run without issues subsequently.

    https://github.com/Raldir/FEVEROUS/blob/bfbd67460bba9b4ff12c9901d76b70e34aaf1501/src/baseline/retriever/document_entity_tfidf_ir.py#L37

    I think this needs to be uncommented?

    opened by creisle 4
  • RoBERTa+NLI model in Table 4

    RoBERTa+NLI model in Table 4

    Is it possible to provide the RoBERTa+NLI model as described in Table 4 of the paper? We'd like to compare with other methods other than NEI Sampling.

    opened by ginaaunchat 3
  • sqlite3.OperationalError: no such table: wiki

    sqlite3.OperationalError: no such table: wiki

    When I run the command PYTHONPATH=src/feverous python src/feverous/baseline/retriever/build_db.py --db_path data/feverous_wikiv1.db --save_path data/feverous-wiki-docs.db in README to reproduce the baseline, I encounter the following error message:

    Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
    [INFO] 2022-02-21 20:06:52,952 - DrQA BuildDB - Reading into database...
    Traceback (most recent call last):
      File "src/feverous/baseline/retriever/build_db.py", line 147, in <module>
        store_contents(
      File "src/feverous/baseline/retriever/build_db.py", line 102, in store_contents
        docs = db.get_doc_ids()
      File "/path/to/FEVEROUS/src/feverous/database/feverous_db.py", line 41, in get_doc_ids
        cursor.execute("SELECT id FROM wiki")
    sqlite3.OperationalError: no such table: wiki
    

    It seems that the data I downloaded is not complete. But the data is downloaded via ./scripts/download_data.sh, which should be OK.

    Thanks for helping me!

    opened by lyy1994 3
  • Add pip/python instructions

    Add pip/python instructions

    Summary

    • Adds instructions for running the README steps without conda (basic pip/python virtualenv)
    • Adds a setup.py file to avoid having to set the PYTHONPATH manually before each command
    • Does some linting on the README file to add new lines and code types to fenced code blocks

    PS. Not sure if you want contributions at this point but I did this for myself when I was trying out the baseline and thought it might be helpful. Feel free to decline if not.

    Note: to actually publish this on pypi you'll probably want to change the top-level namespace rather than installing from the src directory since the sub-package names are pretty general.

    If you are open to changes I have a second PR in-place I can add later which adds a snakemake file so users can run 1 command for all the README steps

    opened by creisle 3
  • the format of evidence in feverous_scorer

    the format of evidence in feverous_scorer

    I was trying the feverous_scorer in evaluation and got confused about the format of "predicted_evidence" and "evidence".

    1. In function "check_predicted_evidence_format",
    assert all(len(prediction) == 3
                       for prediction in instance["predicted_evidence"]), \
    

    indicates that the "predicted_evidence" is a set of evidence pieces in length of 3. But the required format of evidence in http://fever.ai is a list (at maximum three) of evidence sets. Should the "predicted_evidence" be flattened before processing feverous_scorer?

    1. In function "is_strictly_correct",
    for evience_group in instance["evidence"]:
           #Filter out the annotation ids. We just want the evidence page and line number
           actual_sentences = [[e[0], e[1], e[2]] for e in evience_group]
           #Only return true if an entire group of actual sentences is in the predicted sentences
    
           if all([actual_sent in instance["predicted_evidence"] for actual_sent in actual_sentences]):
               return True
    

    indicates that the "actual_sentences" set contains evidence pieces in the format: (page,type,main_position). But the "evidence" in development set is a list of evidence sets, actual_sentences would not retrieve evidence pieces in the format: (page,type,main_position). Also, assume that actual_sentences contains evidence pieces in the format: (page,type,main_position), while the evidence pieces of "predicted_evidence" in the format: (page,type,position), it doesn't seem to return any true point.

    Could you clarify the format of "predicted_evidence" and "evidence" for feverous_scorer?

    opened by ginaaunchat 3
  • Querying DB with texts with special symbols does not return results.

    Querying DB with texts with special symbols does not return results.

    I've noticed search queries with umlauts and other symbols also return nothing from the database while in fact there is a result in there.

    For example: If I try to search for Didier Lourenço using db.get_doc_json() it returns nothing. But when I try to search the database directly using SELECT * FROM wiki WHERE id = 'Didier Lourenço'; it does return the correct result from the database.

    The line under src/database/feverous_db.py to "normalize" the doc_id is commented out.

    https://github.com/Raldir/FEVEROUS/blob/main/src/database/feverous_db.py#L47-L57

    Using the "normalized" query seems to work. Any reason why this was commented out?

    opened by narayanacharya6 3
  • fail to submit the predition

    fail to submit the predition

    Hi, I have produced a prediction file and just wanted to obtain a result on the official evaluation website but found that did not work. Could you please point out any mistake in the submission format? Or is it just a BUG of this platform? The screenshot of my submission is shown below that it is always reminded that 'a valid publication URL is needed', where I am quite confused about.
    image

    Looking forward to your prompt reply.

    opened by Jason98Xu 2
  • Question about hardware specs

    Question about hardware specs

    @Raldir I am trying to run the fine-tuning training for the verdict prediction model but I keep running into CUDA memory issues. Do you remember what hardware specifications were required when you ran this?

    opened by creisle 2
  • Module

    Module "BM25_doc_ranker" not in found

    When I run the command: PYTHONPATH=src python src/baseline/retriever/build_db.py --db_path data/feverous_wikiv1.db --save_path data/feverous-wiki-docs.db, I get the following error:

    Traceback (most recent call last):
      File "src/baseline/retriever/build_db.py", line 19, in <module>
        from baseline.drqa.retriever import utils
      File "/home/martin/FEVEROUS/src/baseline/drqa/retriever/__init__.py", line 24, in <module>
        from .BM25_doc_ranker import BM25DocRanker
    ModuleNotFoundError: No module named 'baseline.drqa.retriever.BM25_doc_ranker'
    

    It seems that it is looking for the module BM25_doc_ranker, but this module does not seem to exist in the repo.

    opened by Martin36 1
  • Mismatching document IDs in training data and database

    Mismatching document IDs in training data and database

    It seems that some documents have different IDs in the training data and in the database. For example the train data example nr 52, with the claim and evidence as follows:

    'claim': 'Participating teams in the 2012–13 Macedonian First Football League '
              'included club Bregalnica, managed by Dobrinko Ilievski, and club '
              'Shkëndija, managed by Artim Shakiri, a retired football midfielder '
              'from North Macedonia.',
     'evidence': [{'content': ['2012–13 Macedonian First Football '
                               'League_cell_1_1_0',
                               '2012–13 Macedonian First Football '
                               'League_cell_1_1_1',
                               '2012–13 Macedonian First Football '
                               'League_cell_1_8_0',
                               '2012–13 Macedonian First Football '
                               'League_cell_1_8_1',
                               'Artim Šakiri_sentence_0'],
    

    Here one part of the evidence is from the document Artim Šakiri. But trying to get a document with this ID from the database returns None. By looking through the list of IDs in the database I found that the closest one to Artim Šakiri is Artim Å akiri.

    opened by Martin36 1
  • TypeError running cell extraction with downloaded model

    TypeError running cell extraction with downloaded model

    I am probably doing something incorrect here but I am not sure what. I got everything to run up to and including the combine script, src/baseline/retriever/combine_retrieval.py. I then downloaded the models linked in the README and tried to run the table cell extraction step and I run into this issue (see trace below)

    /projects/creisle_prj/creisle_scratch/FEVEROUS/venv/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0
    [INFO] 2021-09-14 12:58:40,249 - LogHelper - Log Helper set up
    [INFO] 2021-09-14 12:58:41,154 - __main__ - Start extracting cells from Tables...
    
      0%|                                                  | 0/7890 [00:00<?, ?it/s]
    100%|███████████████████████████████████| 7890/7890 [00:00<00:00, 259469.96it/s]
    Ignored unknown kwargs option trim_offsets
    Traceback (most recent call last):
      File "src/baseline/retriever/predict_cells_from_table.py", line 322, in <module>
        main()
      File "src/baseline/retriever/predict_cells_from_table.py", line 317, in main
        extract_cells_from_tables(annotations, args)
      File "src/baseline/retriever/predict_cells_from_table.py", line 260, in extract_cells_from_tables
        predictions =  (model_output.predictions > 0.25).astype(int)
    TypeError: '>' not supported between instances of 'NoneType' and 'float'
    

    Have you seen this before? Any idea what I might have done wrong?

    System specs OS: centos07 python: 3.7 conda or pip: pip

    opened by creisle 6
  • a wrong sample in the dev dataset?

    a wrong sample in the dev dataset?

    The claim of the seventh sample in dev set is "Per Axel Rydberg, born on July 6, 1860, in Odh, Västergötland, situated outside Sweden, was a graduate of University of Nebraska–Lincoln in the field of Botany (the science of plant life and a branch of biology).". I think this sample should be refused because Västergötland is in Sweden rather than outside Sweden. However, it is labeled "SUPPORTS". Does anyone have any ideas?

    good first issue Dataset 
    opened by leezythu 1
Owner
Rami
Rami
CRUD Build a system to insert student name information, grade the class name, and edit and delete this information

CRUD Build a system to insert student name information, grade the class name, and edit and delete this information

Sajjad 2 Aug 14, 2022
RRR makes structured data for WordPress really rich, and really easy.

Really Rich Results - JSON-LD Structured Data (Google Rich Results) for WordPress Search engines are putting more weight on structured data than ever

Pagely 22 Dec 1, 2022
Hoa is a modular, extensible and structured set of PHP libraries

Hoa is a modular, extensible and structured set of PHP libraries. Moreover, Hoa aims at being a bridge between industrial and research worlds. Hoa\Ust

Hoa 403 Dec 20, 2022
Smd tags - A Textpattern CMS plugin for unlimited, structured taxonomy across content types.

smd_tags Tag articles, images, files and links with stuff, then use the public-side tags to display the lists, filter or find related content. Feature

Stef Dawson 4 Dec 26, 2022
The Phar Installation and Verification Environment (PHIVE)

The Phar Installation and Verification Environment (PHIVE) Installation and verification of phar distributed PHP applications has never been this easy

null 509 Dec 29, 2022
AppGallery IAP is a PHP library to handle AppGallery purchase verification and Server Notifications

AppGallery IAP About AppGallery IAP is a PHP library to handle AppGallery purchase verification and Server Notifications. This package simplifies deve

Dmitry 6 Aug 10, 2022
The Current US Version of PHP-Nuke Evolution Xtreme v3.0.1b-beta often known as Nuke-Evolution Xtreme. This is a hardened version of PHP-Nuke and is secure and safe. We are currently porting Xtreme over to PHP 8.0.3

2021 Nightly Builds Repository PHP-Nuke Evolution Xtreme Developers TheGhost - Ernest Allen Buffington (Lead Developer) SeaBeast08 - Sebastian Scott B

Ernest Buffington 7 Aug 28, 2022
Core - ownCloud gives you freedom and control over your own data.

ownCloud Core ownCloud gives you freedom and control over your own data. A personal cloud which runs on your own server. Why is this so awesome? ?? Ac

ownCloud 7.9k Jan 4, 2023
This Statamic addon allows you to modify the tags rendered by the Bard fieldtype, giving you full control over the final HTML.

Bard Mutator This Statamic addon allows you to modify the tags rendered by the Bard fieldtype, giving you full control over the final HTML. You can ad

Jack Sleight 10 Sep 26, 2022
JsonQ is a simple, elegant PHP package to Query over any type of JSON Data

php-jsonq JsonQ is a simple, elegant PHP package to Query over any type of JSON Data. It'll make your life easier by giving the flavour of an ORM-like

Nahid Bin Azhar 834 Dec 25, 2022
Chat over your local network: 127.0.0.1

#Howto: install packages: apache2 (or nginx but I wouldn't prefer it if you're using your local computer) php for ubuntu/debian instance: $ apt instal

Omer Erbilgin 1 Jan 12, 2022
Raspberry Pi wifi hotspot with an offline-first community portal. Optionally shares internet access over Tor.

Raspberry Pi wifi hotspot with an offline-first community portal. Optionally shares internet access over Tor.

Martti Malmi 17 Dec 15, 2022
Keep control over the complexity of your methods by checking that they do not have too many arguments.

php arguments detector The ideal number of arguments for a function is zero. ~ Robert C. Martin Keep control over the complexity of your methods by ch

DeGraciaMathieu 13 Dec 26, 2022
Laravel Plans is a package for SaaS apps that need management over plans, features, subscriptions, events for plans or limited, countable features.

Laravel Plans Laravel Plans is a package for SaaS apps that need management over plans, features, subscriptions, events for plans or limited, countabl

ángel 2 Oct 2, 2022
Laravel & Solana Phantom wallet example built with Bootstrap, JQuery. App connects to Phantom wallet and fetching publicKey and balance information.

Phantom Wallet Authentication Example Laravel & Solana ($SOL) Phantom wallet example built with Bootstrap, JQuery. This is a Web 3.0 app that connects

Solanacraft 3 Oct 19, 2022
Magento 2 Megamenu extension is an indispensable component, and plays the role of website navigation to help customers easily categorize and find information

Mageno 2 Mega Menu (Magicmenu) helps you create neat and smart navigation menus to display the main categories on your website.

https://magepow.com 35 Dec 1, 2022
Michael Pratt 307 Dec 23, 2022
PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language

php-text-analysis PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP l

null 464 Dec 28, 2022
It allows frontend developer to send ajax requests and return a custom information from the server without a php developer help

[Magento 1 Extension] It allows frontend developer to send ajax requests and return a custom information from the server without a php developer help (without any php code).

Vladimir Fishchenko 62 Apr 1, 2022