Helix

A PHP library for counting short DNA sequences for use in Bioinformatics. Helix consists of tools for data extraction as well as an ultra-low memory hash table called DNA Hash specialized for counting DNA sequences. DNA Hash stores sequence counts by their up2bit encoding - a two-way hash that exploits the fact that each DNA base need only 2 bits to be fully encoded. Accordingly, DNA Hash uses less memory than a lookup table that stores raw gene sequences. In addition, DNA Hash's novel layered Bloom filter eliminates the need to explicitly store counts for sequences that have only been seen once.

Ultra-low memory footprint
Compatible with FASTA and FASTQ formats
Supports canonical sequence counting
Open-source and free to use commercially

Note: The maximum sequence length is platform dependent. On a 64-bit machine, the max length is 31. On a 32-bit machine, the max length is 15.

Note: Due to the probabilistic nature of the Bloom filter, DNA Hash may over count sequences at a bounded rate.

Installation

Install into your project using Composer:

$ composer require andrewdalpino/helix

Requirements

PHP 7.4 or above

Example

use Helix\DNAHash;
use Helix\Extractors\FASTA;
use Helix\Tokenizers\Canonical;
use Helix\Tokenizers\Kmer;

$extractor = new FASTA('example.fa');

$tokenizer = new Canonical(new Kmer(25));

$hashTable = new DNAHash(0.001);

foreach ($extractor as $sequence) {
    $tokens = $tokenizer->tokenize($sequence);

    foreach ($tokens as $token) {
        $hashTable->increment($token);
    }
}

$top10 = $hashTable->top(10);

print_r($top10);

Array
(
    [GCTATAAAAAGAAAATTTTGGAATA] => 19
    [ATTCCAAAATTTTCTTTTTATAGCC] => 19
    [TAAAAAGAAAATTTTGGAATAAAAA] => 18
    [ATAAAAAGAAAATTTTGGAATAAAA] => 18
    [TATAAAAAGAAAATTTTGGAATAAA] => 18
    [CTATAAAAAGAAAATTTTGGAATAA] => 18
    [AAATAATTTCAATTTTCTATCTCAA] => 17
    [AAAATAATTTCAATTTTCTATCTCA] => 17
    [CAAAATAATTTCAATTTTCTATCTC] => 17
    [AGATAGAAAATTGAAATTATTTTGA] => 17
)

Testing

To run the unit tests:

$ composer test

Static Analysis

To run static code analysis:

$ composer analyze

Benchmarks

To run the benchmarks:

$ composer benchmark

References

[1] https://github.com/JohnLonginotto/ACGTrie/blob/master/docs/UP2BIT.md.
[2] P. Melsted et al. (2011). Efficient counting of k-mers in DNA sequences using a bloom filter.
[3] S. Deorowicz et al. (2015). KMC 2: fast and resource-frugal k-mer counting.

JSON Object Signing and Encryption library for PHP.

NAMSHI | JOSE Deprecation notice Hi there, as much as we'd like to be able to work on all of the OSS in the world, we don't actively use this library

1.7k Dec 22, 2022

A library for generating random numbers and strings

RandomLib A library for generating random numbers and strings of various strengths. This library is useful in security contexts. Install Via Composer

832 Nov 24, 2022

Fast, general Elliptic Curve Cryptography library. Supports curves used in Bitcoin, Ethereum and other cryptocurrencies (secp256k1, ed25519, ..)

Fast Elliptic Curve Cryptography in PHP Information This library is a PHP port of elliptic, a great JavaScript ECC library. Supported curve types: Sho

178 Dec 28, 2022

A multitool library offering access to recommended security related libraries, standardised implementations of security defences, and secure implementations of commonly performed tasks.

SecurityMultiTool A multitool library offering access to recommended security related libraries, standardised implementations of security defences, an

131 Oct 30, 2022

A cryptography API wrapping the Sodium library, providing a simple object interface for symmetrical and asymmetrical encryption, decryption, digital signing and message authentication.

PHP Encryption A cryptography API wrapping the Sodium library, providing a simple object interface for symmetrical and asymmetrical encryption, decryp