Fact Extraction and VERification Over Unstructured and Structured information

Rami

Last update: Dec 9, 2022

Related tags

Miscellaneous FEVEROUS

Overview

Fact Extraction and VERification Over Unstructured and Structured information

This repository maintains the code to generate and prepare the dataset, as well as the code of the annotation platform used to generate the FEVEROUS datset. Visit http://fever.ai to find out more about the shared task.

Install Requirements

Create a new Conda environment and install torch:

conda create -n feverous python=3.8
conda activate feverous
conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 -c pytorch

Then install the package requirements specified in src/requirements.txt. Then install the English Spacy model python -m spacy download en_core_web_sm. Code has been tested for python3.7 and python3.8.

Prepare Data

Call the following script to download the FEVEROUS data:

./scripts/download_data.sh

Or you can download the data from the FEVEROUS dataset page directly. Namely:

Training Data
Development Data
Wikipedia Data as a database (sqlite3)

After downloading the data, unpack the Wikipedia data into the same folder (i.e. data).

Reading Data

Read Annotation Data

To process annotation files we provide a simple processing script annotation_processor.py. The script currently does not support the use of annotator operations.

Read Wikipedia Data

This repository contains elementary code to assist you in reading and processing the provided Wikipedia data. By creating a a WikiPage object using the json data of a Wikipedia article, every element of an article is instantiated as a WikiElement on top of several utility functions you can then use (e.g. get an element's context, get an element by it's annotation id, ...).

from database.feverous_db import FeverousDB
from utils.wiki_page import WikiPage

db =  FeverousDB("path_to_the_wiki")

page_json = db.get_doc_json("Anarchism")
wiki_page = WikiPage("Anarchism", page_json)

context_sentence_14 = wiki_page.get_context('sentence_14') # Returns list of context Wiki elements

prev_elements = wiki_page.get_previous_k_elements('sentence_5', k=4) #Gets Wiki element before sentence_5
next_elements = wiki_page.get_next_k_elements('sentence_5', k=4) #Gets Wiki element after sentence_5

WikiElement

There are five different types of WikiElement: WikiSentence, WikiTable, WikiList, WikiSection, and WikiTitle.

A WikiElement defines/overrides four functions:

get_ids: Returns list of all ids in that element
get_id: Return the specific id of that element
id_repr: Returns a string representation of all ids in that element
__str__: Returns a string representation of the element's content

WikiSection additionally defines a function get_level to get the depth level of the section. WikiTable and WikiList have some additional funcions, explained below.

Reading Tables

A WikiTable object takes a table from the Wikipedia Data and normalizes the table to column_span=1 and row_span=1. It also adds other quality of life features to processing the table or its rows.

wiki_tables = wiki_page.get_tables() #return list of all Wiki Tables

wiki_table_0 = wiki_tables[0]
wiki_table_0_rows = wiki_table_0.get_rows() #return list of WikiRows
wiki_table_0_header_rows = wiki_table_0.get_header_rows() #return list of WikiRows that are headers
is_header_row = wiki_table_0_rows[0].is_header_row() #or check the row directly whether it is a header


cells_row_0 = wiki_table_0_rows[0].get_row_cells()#return list with WikiCells for row 0
row_representation = '|'.join([str(cell) for cell in cells_row_0]) #get cell content seperated by vertical line
row_representation_same = str(cells_row_0) #or just stringfy the row directly.

#returns WikiTable from Cell_id. Useful for retrieving associated Tables for cell annotations.
table_0_cell_dict = wiki_page.get_table_from_cell_id(cells_row_0[0].get_id())

Reading Lists

wiki_lists = wiki_page.get_lists()
wiki_lists_0 = wiki_lists[0]
#String representation: Prefixes '-' for unsorted elements and enumerations (1., 2. ...) for sorted elements
print(str(wiki_lists_0))

wiki_lists[0].get_list_by_level(0) #returns list elements by level

Baseline

Retriever

Our baseline retriever module is a combination of entity matching and TF-IDF using DrQA. We first extract the top $k$ pages by matching extracted entities from the claim with Wikipedia articles. If less than k pages have been identified this way, the remaining pages are selected by Tf-IDF matching between the introductory sentence of an article and the claim. To use TF-IDF matching we need to build a TF-IDF index. Run:

PYTHONPATH=src python src/baseline/retriever/build_db.py --db_path data/feverous_wikiv1.db --save_path data/feverous-wiki-docs.db
PYTHONPATH=src python src/baseline/retriever/build_tfidf.py --db_path data/feverous-wiki-docs.db --out_dir data/index/

We can now extract the top k documents:

PYTHONPATH=src python src/baseline/retriever/document_entity_tfidf_ir.py  --model data/index/feverous-wiki-docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --db data/feverous-wiki-docs.db --count 5 --split dev --data_path data/

The top l sentences and q tables of the selected pages are then scored separately using TF-IDF. We set l=5 and q=3.

PYTHONPATH=src python src/baseline/retriever/sentence_tfidf_drqa.py --db data/feverous_wikiv1.db --split dev --max_page 5 --max_sent 5 --use_precomputed false --data_path data/
PYTHONPATH=src python src/baseline/retriever/table_tfidf_drqa.py --db data/feverous_wikiv1.db --split dev --max_page 5 --max_tabs 3 --use_precomputed false --data_path data/

Combine both retrieved sentences and tables into one file:

PYTHONPATH=src python src/baseline/retriever/combine_retrieval.py --data_path data --max_page 5 --max_sent 5 --max_tabs 3 --split dev

For the next steps, we employ pre-trained transformers. You can either train these themselves (c.f. next section) or download our pre-trained models directly (We recommend training the model yourself as the version used in the paper has not been trained on the full training set). The Cell extraction model can be downloaded here. Extract the model and place it into the folder models.

To extract relevant cells from extracted tables, run:

PYTHONPATH=src python src/baseline/retriever/predict_cells_from_table.py --input_path data/dev.combined.not_precomputed.p5.s5.t3.jsonl --max_sent 5 --wiki_path data/feverous_wikiv1.db --model_path models/feverous_cell_extractor

Verdict Prediction

To predict the verdict given either download our fine-tuned model here or train it yourself (c.f. Training). Again, we recommend training the model yourself as the model used in the paper has not been trained on the full training set. Then run:

 PYTHONPATH=src python src/baseline/predictor/evaluate_verdict_predictor.py --input_path data/dev.combined.not_precomputed.p5.s5.t3.cells.jsonl --wiki_path data/feverous_wikiv1.db --model_path models/feverous_verdict_predictor

Training

TBA

Evaluation

To evaluate your generated predictions locally, simply run the file evaluate.py as following:

python evaluation/evaluate.py --input_path data/dev.combined.not_precomputed.p5.s5.t3.cells.verdict.jsonl

Note that any input file needs to define the fields label, predicted_label, evidence, and predicted_evidence in the format specified in the file feverous_scorer.

Shared Task submission

TBA

Comments

Bugfix/pkg import paths
Saw this error running the lastest version (0.0.3)

Traceback (most recent call last): File "src/feverous/baseline/predictor/train_verdict_predictor.py", line 19, in <module> from utils.annotation_processor import AnnotationProcessor, EvidenceType ModuleNotFoundError: No module named 'utils'

This PR fixes the above path bugs. Now that the src code is being packed under a feverous folder, any global imports should start with "feverous." so that python can find the submodules etc.

There's a mix of global and relative path imports in the code base so I wasn't sure which one you wanted, I've gone with global imports for now since the scripts can still be run directly then

Note: This fix is required for the other PR (https://github.com/Raldir/FEVEROUS/pull/17) which I have put into draft pending this PR
opened by creisle 8
KeyError on missing ID

When I run the steps in the README I get an error in one of the downstream steps due to missing an ID field. I did a basic grep through the code and found a commented out line that sets this. I re-added it and was able to run without issues subsequently.

https://github.com/Raldir/FEVEROUS/blob/bfbd67460bba9b4ff12c9901d76b70e34aaf1501/src/baseline/retriever/document_entity_tfidf_ir.py#L37

I think this needs to be uncommented?

opened by creisle 4
RoBERTa+NLI model in Table 4

Is it possible to provide the RoBERTa+NLI model as described in Table 4 of the paper? We'd like to compare with other methods other than NEI Sampling.

opened by ginaaunchat 3

sqlite3.OperationalError: no such table: wiki

When I run the command PYTHONPATH=src/feverous python src/feverous/baseline/retriever/build_db.py --db_path data/feverous_wikiv1.db --save_path data/feverous-wiki-docs.db in README to reproduce the baseline, I encounter the following error message:

Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
[INFO] 2022-02-21 20:06:52,952 - DrQA BuildDB - Reading into database...
Traceback (most recent call last):
  File "src/feverous/baseline/retriever/build_db.py", line 147, in <module>
    store_contents(
  File "src/feverous/baseline/retriever/build_db.py", line 102, in store_contents
    docs = db.get_doc_ids()
  File "/path/to/FEVEROUS/src/feverous/database/feverous_db.py", line 41, in get_doc_ids
    cursor.execute("SELECT id FROM wiki")
sqlite3.OperationalError: no such table: wiki

It seems that the data I downloaded is not complete. But the data is downloaded via ./scripts/download_data.sh, which should be OK.

Thanks for helping me!

opened by lyy1994 3

Add pip/python instructions
Summary

Adds instructions for running the README steps without conda (basic pip/python virtualenv)

Adds a setup.py file to avoid having to set the PYTHONPATH manually before each command

Does some linting on the README file to add new lines and code types to fenced code blocks

PS. Not sure if you want contributions at this point but I did this for myself when I was trying out the baseline and thought it might be helpful. Feel free to decline if not.

Note: to actually publish this on pypi you'll probably want to change the top-level namespace rather than installing from the src directory since the sub-package names are pretty general.

If you are open to changes I have a second PR in-place I can add later which adds a snakemake file so users can run 1 command for all the README steps
opened by creisle 3
the format of evidence in feverous_scorer
I was trying the feverous_scorer in evaluation and got confused about the format of "predicted_evidence" and "evidence".

In function "check_predicted_evidence_format",

assert all(len(prediction) == 3 for prediction in instance["predicted_evidence"]), \

indicates that the "predicted_evidence" is a set of evidence pieces in length of 3. But the required format of evidence in http://fever.ai is a list (at maximum three) of evidence sets. Should the "predicted_evidence" be flattened before processing feverous_scorer?

In function "is_strictly_correct",

for evience_group in instance["evidence"]: #Filter out the annotation ids. We just want the evidence page and line number actual_sentences = [[e[0], e[1], e[2]] for e in evience_group] #Only return true if an entire group of actual sentences is in the predicted sentences if all([actual_sent in instance["predicted_evidence"] for actual_sent in actual_sentences]): return True

indicates that the "actual_sentences" set contains evidence pieces in the format: (page,type,main_position). But the "evidence" in development set is a list of evidence sets, actual_sentences would not retrieve evidence pieces in the format: (page,type,main_position). Also, assume that actual_sentences contains evidence pieces in the format: (page,type,main_position), while the evidence pieces of "predicted_evidence" in the format: (page,type,position), it doesn't seem to return any true point.

Could you clarify the format of "predicted_evidence" and "evidence" for feverous_scorer?
opened by ginaaunchat 3
Querying DB with texts with special symbols does not return results.

I've noticed search queries with umlauts and other symbols also return nothing from the database while in fact there is a result in there.

For example: If I try to search for Didier Lourenço using db.get_doc_json() it returns nothing. But when I try to search the database directly using SELECT * FROM wiki WHERE id = 'Didier Lourenço'; it does return the correct result from the database.

The line under src/database/feverous_db.py to "normalize" the doc_id is commented out.

https://github.com/Raldir/FEVEROUS/blob/main/src/database/feverous_db.py#L47-L57

Using the "normalized" query seems to work. Any reason why this was commented out?

opened by narayanacharya6 3
fail to submit the predition

Hi, I have produced a prediction file and just wanted to obtain a result on the official evaluation website but found that did not work. Could you please point out any mistake in the submission format? Or is it just a BUG of this platform? The screenshot of my submission is shown below that it is always reminded that 'a valid publication URL is needed', where I am quite confused about.

Looking forward to your prompt reply.

opened by Jason98Xu 2
Question about hardware specs

@Raldir I am trying to run the fine-tuning training for the verdict prediction model but I keep running into CUDA memory issues. Do you remember what hardware specifications were required when you ran this?

opened by creisle 2

Module "BM25_doc_ranker" not in found

When I run the command: PYTHONPATH=src python src/baseline/retriever/build_db.py --db_path data/feverous_wikiv1.db --save_path data/feverous-wiki-docs.db, I get the following error:

Traceback (most recent call last):
  File "src/baseline/retriever/build_db.py", line 19, in <module>
    from baseline.drqa.retriever import utils
  File "/home/martin/FEVEROUS/src/baseline/drqa/retriever/__init__.py", line 24, in <module>
    from .BM25_doc_ranker import BM25DocRanker
ModuleNotFoundError: No module named 'baseline.drqa.retriever.BM25_doc_ranker'

It seems that it is looking for the module BM25_doc_ranker, but this module does not seem to exist in the repo.

opened by Martin36 1

Mismatching document IDs in training data and database

It seems that some documents have different IDs in the training data and in the database. For example the train data example nr 52, with the claim and evidence as follows:

'claim': 'Participating teams in the 2012–13 Macedonian First Football League '
          'included club Bregalnica, managed by Dobrinko Ilievski, and club '
          'Shkëndija, managed by Artim Shakiri, a retired football midfielder '
          'from North Macedonia.',
 'evidence': [{'content': ['2012–13 Macedonian First Football '
                           'League_cell_1_1_0',
                           '2012–13 Macedonian First Football '
                           'League_cell_1_1_1',
                           '2012–13 Macedonian First Football '
                           'League_cell_1_8_0',
                           '2012–13 Macedonian First Football '
                           'League_cell_1_8_1',
                           'Artim Šakiri_sentence_0'],

Here one part of the evidence is from the document Artim Šakiri. But trying to get a document with this ID from the database returns None. By looking through the list of IDs in the database I found that the closest one to Artim Šakiri is Artim Å akiri.

opened by Martin36 1

TypeError running cell extraction with downloaded model

I am probably doing something incorrect here but I am not sure what. I got everything to run up to and including the combine script, src/baseline/retriever/combine_retrieval.py. I then downloaded the models linked in the README and tried to run the table cell extraction step and I run into this issue (see trace below)

/projects/creisle_prj/creisle_scratch/FEVEROUS/venv/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
[INFO] 2021-09-14 12:58:40,249 - LogHelper - Log Helper set up
[INFO] 2021-09-14 12:58:41,154 - __main__ - Start extracting cells from Tables...

  0%|                                                  | 0/7890 [00:00<?, ?it/s]
100%|███████████████████████████████████| 7890/7890 [00:00<00:00, 259469.96it/s]
Ignored unknown kwargs option trim_offsets
Traceback (most recent call last):
  File "src/baseline/retriever/predict_cells_from_table.py", line 322, in <module>
    main()
  File "src/baseline/retriever/predict_cells_from_table.py", line 317, in main
    extract_cells_from_tables(annotations, args)
  File "src/baseline/retriever/predict_cells_from_table.py", line 260, in extract_cells_from_tables
    predictions =  (model_output.predictions > 0.25).astype(int)
TypeError: '>' not supported between instances of 'NoneType' and 'float'

Have you seen this before? Any idea what I might have done wrong?

System specs OS: centos07 python: 3.7 conda or pip: pip

opened by creisle 6

a wrong sample in the dev dataset?

The claim of the seventh sample in dev set is "Per Axel Rydberg, born on July 6, 1860, in Odh, Västergötland, situated outside Sweden, was a graduate of University of Nebraska–Lincoln in the field of Botany (the science of plant life and a branch of biology).". I think this sample should be refused because Västergötland is in Sweden rather than outside Sweden. However, it is labeled "SUPPORTS". Does anyone have any ideas?
good first issue Dataset

opened by leezythu 1

Fact Extraction and VERification Over Unstructured and Structured information

Related tags

Overview

Fact Extraction and VERification Over Unstructured and Structured information

Install Requirements

Prepare Data

Reading Data

Read Annotation Data

Read Wikipedia Data

WikiElement

Reading Tables

Reading Lists

Baseline

Retriever

Verdict Prediction

Training

Evaluation

Shared Task submission

Comments

Owner

Rami

CRUD Build a system to insert student name information, grade the class name, and edit and delete this information

RRR makes structured data for WordPress really rich, and really easy.

Hoa is a modular, extensible and structured set of PHP libraries

Smd tags - A Textpattern CMS plugin for unlimited, structured taxonomy across content types.

The Phar Installation and Verification Environment (PHIVE)

AppGallery IAP is a PHP library to handle AppGallery purchase verification and Server Notifications

The Current US Version of PHP-Nuke Evolution Xtreme v3.0.1b-beta often known as Nuke-Evolution Xtreme. This is a hardened version of PHP-Nuke and is secure and safe. We are currently porting Xtreme over to PHP 8.0.3

Core - ownCloud gives you freedom and control over your own data.

This Statamic addon allows you to modify the tags rendered by the Bard fieldtype, giving you full control over the final HTML.

JsonQ is a simple, elegant PHP package to Query over any type of JSON Data

Chat over your local network: 127.0.0.1

Raspberry Pi wifi hotspot with an offline-first community portal. Optionally shares internet access over Tor.

Keep control over the complexity of your methods by checking that they do not have too many arguments.

Laravel Plans is a package for SaaS apps that need management over plans, features, subscriptions, events for plans or limited, countable features.

Laravel & Solana Phantom wallet example built with Bootstrap, JQuery. App connects to Phantom wallet and fetching publicKey and balance information.

Magento 2 Megamenu extension is an indispensable component, and plays the role of website navigation to help customers easily categorize and find information

A Oembed consumer library, that gives you information about urls. It helps you replace urls to youtube or vimeo for example, with their html embed code. It has advanced features like offline support, responsive embeds and caching support.

PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language

It allows frontend developer to send ajax requests and return a custom information from the server without a php developer help