:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

Overview

QueryList

QueryList

QueryList is a simple, elegant, extensible PHP Web Scraper (crawler/spider) ,based on phpQuery.

API Documentation

中文文档

Features

  • Have the same CSS3 DOM selector as jQuery
  • Have the same DOM manipulation API as jQuery
  • Have a generic list crawling program
  • Have a strong HTTP request suite, easy to achieve such as: simulated landing, forged browser, HTTP proxy and other complex network requests
  • Have a messy code solution
  • Have powerful content filtering, you can use the jQuey selector to filter content
  • Has a high degree of modular design, scalability and strong
  • Have an expressive API
  • Has a wealth of plug-ins

Through plug-ins you can easily implement things like:

  • Multithreaded crawl
  • Crawl JavaScript dynamic rendering page (PhantomJS/headless WebKit)
  • Image downloads to local
  • Simulate browser behavior such as submitting Form forms
  • Web crawler
  • .....

Requirements

  • PHP >= 7.1

Installation

By Composer installation:

composer require jaeger/querylist

Usage

DOM Traversal and Manipulation

  • Crawl「GitHub」all picture links
QueryList::get('https://github.com')->find('img')->attrs('src');
  • Crawl Google search results
$ql = QueryList::get('https://www.google.co.jp/search?q=QueryList');

$ql->find('title')->text(); //The page title
$ql->find('meta[name=keywords]')->content; //The page keywords

$ql->find('h3>a')->texts(); //Get a list of search results titles
$ql->find('h3>a')->attrs('href'); //Get a list of search results links

$ql->find('img')->src; //Gets the link address of the first image
$ql->find('img:eq(1)')->src; //Gets the link address of the second image
$ql->find('img')->eq(2)->src; //Gets the link address of the third image
// Loop all the images
$ql->find('img')->map(function($img){
	echo $img->alt;  //Print the alt attribute of the image
});
  • More usage
$ql->find('#head')->append('<div>Append content</div>')->find('div')->htmls();
$ql->find('.two')->children('img')->attrs('alt'); // Get the class is the "two" element under all img child nodes
// Loop class is the "two" element under all child nodes
$data = $ql->find('.two')->children()->map(function ($item){
    // Use "is" to determine the node type
    if($item->is('a')){
        return $item->text();
    }elseif($item->is('img'))
    {
        return $item->alt;
    }
});

$ql->find('a')->attr('href', 'newVal')->removeClass('className')->html('newHtml')->...
$ql->find('div > p')->add('div > ul')->filter(':has(a)')->find('p:first')->nextAll()->andSelf()->...
$ql->find('div.old')->replaceWith( $ql->find('div.new')->clone())->appendTo('.trash')->prepend('Deleted')->...

List crawl

Crawl the title and link of the Google search results list:

$data = QueryList::get('https://www.google.co.jp/search?q=QueryList')
	// Set the crawl rules
    ->rules([ 
	    'title'=>array('h3','text'),
	    'link'=>array('h3>a','href')
	])
	->query()->getData();

print_r($data->all());

Results:

Array
(
    [0] => Array
        (
            [title] => Angular - QueryList
            [link] => https://angular.io/api/core/QueryList
        )
    [1] => Array
        (
            [title] => QueryList | @angular/core - Angularリファレンス - Web Creative Park
            [link] => http://www.webcreativepark.net/angular/querylist/
        )
    [2] => Array
        (
            [title] => QueryListにQueryを追加したり、追加されたことを感知する | TIPS ...
            [link] => http://www.webcreativepark.net/angular/querylist_query_add_subscribe/
        )
        //...
)

Encode convert

// Out charset :UTF-8
// In charset :GB2312
QueryList::get('https://top.etao.com')->encoding('UTF-8','GB2312')->find('a')->texts();

// Out charset:UTF-8
// In charset:Automatic Identification
QueryList::get('https://top.etao.com')->encoding('UTF-8')->find('a')->texts();

HTTP Client (GuzzleHttp)

  • Carry cookie login GitHub
//Crawl GitHub content
$ql = QueryList::get('https://github.com','param1=testvalue & params2=somevalue',[
  'headers' => [
      // Fill in the cookie from the browser
      'Cookie' => 'SINAGLOBAL=546064; wb_cmtLike_2112031=1; wvr=6;....'
  ]
]);
//echo $ql->getHtml();
$userName = $ql->find('.header-nav-current-user>.css-truncate-target')->text();
echo $userName;
  • Use the Http proxy
$urlParams = ['param1' => 'testvalue','params2' => 'somevalue'];
$opts = [
	// Set the http proxy
    'proxy' => 'http://222.141.11.17:8118',
    //Set the timeout time in seconds
    'timeout' => 30,
     // Fake HTTP headers
    'headers' => [
        'Referer' => 'https://querylist.cc/',
        'User-Agent' => 'testing/1.0',
        'Accept'     => 'application/json',
        'X-Foo'      => ['Bar', 'Baz'],
        'Cookie'    => 'abc=111;xxx=222'
    ]
];
$ql->get('http://httpbin.org/get',$urlParams,$opts);
// echo $ql->getHtml();
  • Analog login
// Post login
$ql = QueryList::post('http://xxxx.com/login',[
    'username' => 'admin',
    'password' => '123456'
])->get('http://xxx.com/admin');
// Crawl pages that need to be logged in to access
$ql->get('http://xxx.com/admin/page');
//echo $ql->getHtml();

Submit forms

Login GitHub

// Get the QueryList instance
$ql = QueryList::getInstance();
// Get the login form
$form = $ql->get('https://github.com/login')->find('form');

// Fill in the GitHub username and password
$form->find('input[name=login]')->val('your github username or email');
$form->find('input[name=password]')->val('your github password');

// Serialize the form data
$fromData = $form->serializeArray();
$postData = [];
foreach ($fromData as $item) {
    $postData[$item['name']] = $item['value'];
}

// Submit the login form
$actionUrl = 'https://github.com'.$form->attr('action');
$ql->post($actionUrl,$postData);
// To determine whether the login is successful
// echo $ql->getHtml();
$userName = $ql->find('.header-nav-current-user>.css-truncate-target')->text();
if($userName)
{
    echo 'Login successful ! Welcome:'.$userName;
}else{
    echo 'Login failed !';
}

Bind function extension

Customize the extension of a myHttp method:

$ql = QueryList::getInstance();

//Bind a `myHttp` method to the QueryList object
$ql->bind('myHttp',function ($url){
	// $this is the current QueryList object
    $html = file_get_contents($url);
    $this->setHtml($html);
    return $this;
});

// And then you can call by the name of the binding
$data = $ql->myHttp('https://toutiao.io')->find('h3 a')->texts();
print_r($data->all());

Or package to class, and then bind:

$ql->bind('myHttp',function ($url){
    return new MyHttp($this,$url);
});

Plugin used

  • Use the PhantomJS plugin to crawl JavaScript dynamically rendered pages:
// Set the PhantomJS binary file path during installation
$ql = QueryList::use(PhantomJs::class,'/usr/local/bin/phantomjs');

// Crawl「500px」all picture links
$data = $ql->browser('https://500px.com/editors')->find('img')->attrs('src');
print_r($data->all());

// Use the HTTP proxy
$ql->browser('https://500px.com/editors',false,[
	'--proxy' => '192.168.1.42:8080',
    '--proxy-type' => 'http'
])
  • Using the CURL multithreading plug-in, multi-threaded crawling GitHub trending :
$ql = QueryList::use(CurlMulti::class);
$ql->curlMulti([
    'https://github.com/trending/php',
    'https://github.com/trending/go',
    //.....more urls
])
 // Called if task is success
 ->success(function (QueryList $ql,CurlMulti $curl,$r){
    echo "Current url:{$r['info']['url']} \r\n";
    $data = $ql->find('h3 a')->texts();
    print_r($data->all());
})
 // Task fail callback
->error(function ($errorInfo,CurlMulti $curl){
    echo "Current url:{$errorInfo['info']['url']} \r\n";
    print_r($errorInfo['error']);
})
->start([
	// Maximum number of threads
    'maxThread' => 10,
    // Number of error retries
    'maxTry' => 3,
]);

Plugins

View more QueryList plugins and QueryList-based products: QueryList Community

Contributing

Welcome to contribute code for the QueryList。About Contributing Plugins can be viewed:QueryList Plugin Contributing Guide

Author

Jaeger [email protected]

If this library is useful for you, say thanks buying me a beer 🍺 !

Lisence

QueryList is licensed under the license of MIT. See the LICENSE for more details.

Comments
  • 错误报告Document with ID '67ca6e7b6472494a20aff43a4739ab59' isn't loaded. Use phpQuery::newDocument($html) or phpQuery::newDocumentFile($file) first.

    错误报告Document with ID '67ca6e7b6472494a20aff43a4739ab59' isn't loaded. Use phpQuery::newDocument($html) or phpQuery::newDocumentFile($file) first.

    挂机采集时出现错误报告Document with ID '67ca6e7b6472494a20aff43a4739ab59' isn't loaded. Use phpQuery::newDocument($html) or phpQuery::newDocumentFile($file) first.请问是什么原因造成的?

    opened by windpursuer 13
  • 更新了composer就一直报这个错,写法跟文档一样

    更新了composer就一直报这个错,写法跟文档一样

    Error 500: Internal Server Error { "message": "Argument 1 passed to QL\Dom\Query::handleData() must be an instance of Tightenco\Collect\Support\Collection, instance of Illuminate\Support\Collection given, called in /home/vagrant/code/yishang/vendor/jaeger/querylist/src/Dom/Query.php on line 142", "status_code": 500 }

    代码 $url = 'https://it.ithome.com/ityejie/'; // 元数据采集规则 $rules = [ // 采集文章标题 'title' => ['h2>a','text'], // 采集链接 'link' => ['h2>a','href'], // 采集缩略图 'img' => ['.list_thumbnail>img','src'], // 采集文档简介 'desc' => ['.memo','text'] ]; // 切片选择器 $range = '.content li'; $rt = QueryList::get($url)->rules($rules)->range($range)->query()->getData(); print_r($rt->all());die();

    opened by smiaoO712 6
  • 高版本laravel会出现很多莫名其妙的问题

    高版本laravel会出现很多莫名其妙的问题

    之前也遇到过一些在高版本laravel上出现一些莫名其妙的结果,还以为自己的代码写得有问题。 今天试了下laravel7,结果遇到更加莫名其妙的问题。获取列表时,如果是href,只能获取到第一条数据。如果有text,可以全部获取到,但是被全部合并成一条数据了。 laravel7框架执行结果: image laravel5.8框架r执行结果: image 后来我只好把代码拷贝到laravel5.8的框架下,执行后得到的数据是正常的。 这个框架其实本来还挺好用的,但是遇到这种问题,心里挺烦的。

    用这个框架开发效率还是挺高的。希望作者多多优化。谢谢!

    opened by aboutboy 5
  • 列表采集数据不对

    列表采集数据不对

    $search_url = 'http://so.iqiyi.com/so/q_' . $keyword; $rules = [ //div[@class='mod_search_result']/div/ul/li[1]/h3[@class="result_title"] 'title' => ['div>h3','text'], //div[@class='mod_search_result']/div/ul/li[1]/a/img/@src 'image' => ['a>img','src'] ]; $range = '.mod_search_result>div>ul'; $data = QueryList::get($search_url)->rules($rules)->range($range)->query()->getData(); //打印结果 print_r($data->all());

    opened by navysummer 5
  • phpQuery有个bug,那就是当HTML中有它无法识别的特殊字符时,HTML就会被截断,导致最终的采集结果不正确

    phpQuery有个bug,那就是当HTML中有它无法识别的特殊字符时,HTML就会被截断,导致最终的采集结果不正确

    phpQuery有个bug,那就是当HTML中有它无法识别的特殊字符时,HTML就会被截断,导致最终的采集结果不正确,此时可以尝试使用正则或其它方式获取到要采集的内容的HTML片段,把这个HTML片段传给QueryList,从而可以解决这种场景下的问题。

    请问这个BUG会修复吗?特殊字符,表情会中断,对我的业务需求来说.问题比较大.

    opened by ghost 4
  •  转换编码格式报错

    转换编码格式报错

    版本:3.1.2
    php版本:5.3.27 处理:GB2312 转 UTF-8

    报错截图: qq 20170906123551

    不影响正常执行

    相关代码:

    private function _arrayConvertEncoding($arr, $toEncoding, $fromEncoding)
    {
        eval('$arr = '.iconv($fromEncoding, $toEncoding.'//IGNORE', var_export($arr,TRUE)).';');
        return $arr;
    }
    
    opened by storyflow 4
  • 4.2.8 版本 没法用啊

    4.2.8 版本 没法用啊

    composer 升级到新版4.2.8后, QueryList::get($cai_url)->rules([ 'title'=>array('h3','text'), 'link'=>array('h3>a','href') ])->query()->getData(); 只能得到:Array ( [title] => Angular - QueryList [link] => https://angular.io/api/core/QueryList )

    没法得到多维数组了,回退到之前使用的4.0.1 就正常了

    opened by cokyhe 3
  • 在某些情况下,html(setHtml)会有问题

    在某些情况下,html(setHtml)会有问题

    比如这里:http://xiaohua.zol.com.cn/lengxiaohua/34.html 我setHtml后,发现html的内容被破坏了,少了一个

  • 导致DOM无法解析,如下:

    \n \n
    \n
\n
  • \n \n 最爱冷笑话,开心还能练大脑\n
    来源:笑话集
    \n
    \n 在
  • \n这一句本来前面还应该有个
  • ,结果setHtml后这个不见了。。

    我的代码片段: $listql = $ql->myGet($listurl,[],$options); $listql->use(AbsoluteUrl::class); if($charset != 'utf-8'){ $listhtml = Http::get($listurl,[],$options)->body(); //dump($listhtml); $listhtml = mb_convert_encoding($listhtml,'UTF-8','GBK'); //dump($listhtml); $listhtml = preg_replace('/<meta[^<>]charset[^<>]?>/i', '', $listhtml); dump($listhtml);//在这里打印还是正常的,有 $listql->setHtml($listhtml); } if ($listql->getHtml() == ''){ continue; } dump($listql);//这里打印就发现有一个节点少了,然后下一句就出错了 $listdata = $listql->absoluteUrl($listurl)->rules($listrule)->range($listrange)->query()->getData();

    错误日志: ErrorException

    DOMXPath::query(): Invalid expression

    at vendor/jaeger/phpquery-single/phpQuery.php:1765 1761| ? '//*' 1762| : $xpath.$XQuery; 1763| $this->debug("XPATH: {$query}"); 1764| // run query, get elements

    1765| $nodes = $this->xpath->query($query); 1766| $this->debug("QUERY FETCHED"); 1767| if (! $nodes->length ) 1768| $this->debug('Nothing found'); 1769| $debug = array();

    opened by aboutboy 3
  • 谁研究过乱码的问题?

    谁研究过乱码的问题?

    好像是guzzlehttp的问题。 http://xiaohua.zol.com.cn/ 不管怎样都是乱码,后来只能临时用file_get_contents再加上手动转码。 但是我现在希望写一个通用的采集器,如果总是会遇到特殊情况的话,就很难用这个框架实现了。。

    太痛苦。有没有曾经研究过这个问题的老鸟们?怎样直接修改guzzlehttp来永久性解决乱码问题呢?

    opened by aboutboy 3
  • 不清楚为什么,采集到的元素,开头都被加了一个貌似空格的字符

    不清楚为什么,采集到的元素,开头都被加了一个貌似空格的字符

    "title" => " 想念的星星不说话",比如这里,标题被加了一个空格。这个空格很奇怪,我用trim居然不能去除,只能复制后用str_replace。 增:也可以用$x['title'] = ltrim($v['title'][0],' ');这种方式去除这个字符。 这个符号我复制到notepad++里面,也搞不清具体是什么符号? 但实际上网页源码中被采集元素的开头是不带空格的。。 原文是这样:class="tooltip">想念的星星不说话 我的laravel版本是:5.8

    opened by aboutboy 3
  • php 7.4 warning: Array and string offset access syntax with curly braces is deprecated

    php 7.4 warning: Array and string offset access syntax with curly braces is deprecated

    {"exception":"[object] (ErrorException(code: 0): Array and string offset access syntax with curly braces is deprecated at /app/vendor/jaeger/phpquery-single/phpQuery.php:2170)

    opened by x-controller 3
  • Fatal error: Uncaught TypeError: Argument 1 passed to QL\Services\MultiRequestService::QL\Services\{closure}() must be an instance of GuzzleHttp\Exception\RequestException, instance of GuzzleHttp\Exception\ConnectException given

    Fatal error: Uncaught TypeError: Argument 1 passed to QL\Services\MultiRequestService::QL\Services\{closure}() must be an instance of GuzzleHttp\Exception\RequestException, instance of GuzzleHttp\Exception\ConnectException given

    我在使用QueryList::rules($rules)->multiGet是总是会得到这个报错,但为什么使用try catch却无法捕获到它,这样会导致程序直接崩溃。 Fatal error: Uncaught TypeError: Argument 1 passed to QL\Services\MultiRequestService::QL\Services{closure}() must be an instance of GuzzleHttp\Exception\RequestException, instance of GuzzleHttp\Exception\ConnectException given in D:\phpstudy\WWW\wpblog\wp-content\plugins\seekhub-collector\vendor\jaeger\querylist\src\Services\MultiRequestService.php:56 Stack trace: #0 [internal function]: QL\Services\MultiRequestService->QL\Services{closure}(Object(GuzzleHttp\Exception\ConnectException), 7, Object(GuzzleHttp\Promise\Promise)) #1 D:\phpstudy\WWW\wpblog\wp-content\plugins\seekhub-collector\vendor\guzzlehttp\promises\src\EachPromise.php(192): call_user_func(Object(Closure), Object(GuzzleHttp\Exception\ConnectException), 7, Object(GuzzleHttp\Promise\Promise)) #2 D:\phpstudy\WWW\wpblog\wp-content\plugins\seekhub-collector\vendor\guzzlehttp\promises\src\Promise.php(204): GuzzleHttp\Promise\EachPromise->GuzzleHttp\Promise{closure}(Object(GuzzleHttp\Exception\ConnectException)) #3 D:\phpstudy\WWW\wpblog\wp-content\plugins\seek in D:\phpstudy\WWW\wpblog\wp-content\plugins\seekhub-collector\vendor\jaeger\querylist\src\Services\MultiRequestService.php on line 56

    opened by jsonhet 1
  • php8.0 laravel 8.5 无法安装

    php8.0 laravel 8.5 无法安装

    % composer require jaeger/querylist                          
    Using version ^4.2 for jaeger/querylist
    ./composer.json has been updated
    Running composer update jaeger/querylist
    Loading composer repositories with package information
    Updating dependencies
    Your requirements could not be resolved to an installable set of packages.
    
      Problem 1
        - jaeger/querylist[V4.2.0, ..., V4.2.8] require jaeger/g-http ^1.1 -> satisfiable by jaeger/g-http[V1.1, ..., V1.7.2].
        - jaeger/g-http V1.7.2 requires cache/filesystem-adapter ^1 -> satisfiable by cache/filesystem-adapter[1.0.0, 1.1.0, 1.1.x-dev (alias of dev-master), 1.2.0].
        - jaeger/g-http[V1.7.0, ..., V1.7.1] require cache/filesystem-adapter ^1.0 -> satisfiable by cache/filesystem-adapter[1.0.0, 1.1.0, 1.1.x-dev (alias of dev-master), 1.2.0].
        - cache/filesystem-adapter 1.1.x-dev is an alias of cache/filesystem-adapter dev-master and thus requires it to be installed too.
        - cache/filesystem-adapter[dev-master, 1.2.0] require psr/cache ^1.0 || ^2.0 -> found psr/cache[1.0.0, 1.0.1, 2.0.0] but the package is fixed to 3.0.0 (lock file version) by a partial update and that version does not match. Make sure you list it as an argument for the update command.
        - cache/filesystem-adapter 1.0.0 requires php ^5.6 || ^7.0 -> your php version (8.0.11) does not satisfy that requirement.
        - cache/filesystem-adapter 1.1.0 requires psr/cache ^1.0 -> found psr/cache[1.0.0, 1.0.1] but the package is fixed to 3.0.0 (lock file version) by a partial update and that version does not match. Make sure you list it as an argument for the update command.
        - jaeger/g-http[V1.1, ..., V1.6.0] require guzzlehttp/guzzle ^6.2 -> found guzzlehttp/guzzle[6.2.0, ..., 6.5.x-dev] but it conflicts with your root composer.json require (^7.0.1).
        - Root composer.json requires jaeger/querylist ^4.2 -> satisfiable by jaeger/querylist[V4.2.0, ..., V4.2.8].
    
    Use the option --with-all-dependencies (-W) to allow upgrades, downgrades and removals for packages currently locked to specific versions.
    
    Installation failed, reverting ./composer.json and ./composer.lock to their original content.
    

    guzzle 冲突,依赖的 symfony/cache 和 psr/cache 版本冲突。

    opened by xpader 1
  • Releases(V4.2.8)
    • V4.2.5(Apr 3, 2020)

    • V4.2.0(Mar 20, 2020)

      Added

      • rules add attributes:
        • texts: get the text of multiple elements
        • htmls: get the html of multiple elements
        • htmlOuter: get the element's outer html
        • htmlOuters: get the outer html of multiple elements
      • destructDocuments():destroy all documents
      • Elements class add htmlOuters() method

      Changed

      • destruct(): will destroy the current object
      • range: when range is not set, the data structure returned changes
      • Elements::each(): callback function parameters changed
      Source code(tar.gz)
      Source code(zip)
    • V4.1.0(Dec 17, 2018)

      Added

      • postJson(): Send POST JSON Request
      • multiGet(): Concurrent GET Request
      • multiPost(): Concurrent Post Request
      • pipe(): data flow pipeline method
      • Add HTTP Cache

      Changed

      • Static calls are no longer in single mode
      Source code(tar.gz)
      Source code(zip)
    • V4.0.1(Dec 6, 2017)

      • Rewrite the entire framework
      • Have an expressive API
      • Fully composer, no longer support manual installation
      • The version of PHP must be larger than PHP7
      • More modular and easier to expand
      • Built in a powerful HTTP plug-in and code conversion plug-in
      • Have almost all the same API as the jQuery operation DOM
      Source code(tar.gz)
      Source code(zip)
    • V3.1(Dec 28, 2015)

    Owner
    Jaeger(黄杰)
    心有猛虎,细嗅蔷薇
    Jaeger(黄杰)
    On-Page SEO Crawler Tool with Interface

    upzon I developed this project with PHP & MYSQL and python. If you have basic python and php knowledge, it is quite simple to use this program. I'm us

    null 5 Oct 27, 2021
    Packagist crawler

    packagist-crawler packagist.orgをクロールして、全てのpackage.jsonをダウンロードします。 ダウンロードし終わったあとでstaticなweb serverで配信すれば、packagist.orgのミラーを作ることができます。 Requirement PHP >

    Hiraku NAKANO 56 Nov 13, 2022
    Library for Rapid (Web) Crawler and Scraper Development

    Library for Rapid (Web) Crawler and Scraper Development This package provides kind of a framework and a lot of ready to use, so-called steps, that you

    crwlr.software 60 Nov 30, 2022
    Extractor (scraper, crawler, parser) of products from Allegro

    Extractor (scraper, crawler, parser) of products from Allegro

    Daniel Yatsura 1 May 11, 2022
    Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution.

    Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.3, 7.4, 8.0.

    null 68 Dec 27, 2022
    PHP Scraper - an highly opinionated web-interface for PHP

    PHP Scraper An opinionated & limited way to scrape the web using PHP. The main goal is to get stuff done instead of getting distracted with xPath sele

    Peter Thaleikis 327 Dec 30, 2022
    Goutte, a simple PHP Web Scraper

    Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

    null 9.1k Jan 1, 2023
    A browser testing and web crawling library for PHP and Symfony

    A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

    Symfony 2.7k Dec 31, 2022
    🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

    crawlerdetect.io About CrawlerDetect CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Current

    Mark Beech 1.7k Dec 30, 2022
    Goutte, a simple PHP Web Scraper

    Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

    null 9.1k Jan 4, 2023
    PHP Discord Webcrawler to log all messages from a Discord Chat.

    Disco the Ripper was created to rip all messages from a Discord specific channel into JSON via CLI and help people to investigate some servers who has awkward channels before they get deleted.

    Daniel Reis 46 Sep 21, 2022
    PHP DOM Manipulation toolkit.

    phpQuery The PHP DOM Manipulation toolkit. Motivation I'm working currently with PHP, and I've missed using something like jQuery in PHP to manipulate

    João Eduardo Fornazari 1 Nov 26, 2021
    This Project is for digikala.com scrapping challenge of 2021 blackfriday using php/laravel/horizon

    Objective This script is intended for finding the hidden treasure, A scraping challenge by digikala for 2021 black Friday Prerequisites Php mysql redi

    ǃшɒʞɒH ǃǀɄ 1 Dec 22, 2021
    Beanbun 是用 PHP 编写的多进程网络爬虫框架,具有良好的开放性、高可扩展性,基于 Workerman

    Beanbun 是用 PHP 编写的多进程网络爬虫框架,具有良好的开放性、高可扩展性,基于 Workerman

    Kidd Yu 1.2k Dec 19, 2022
    Symfony bundle for Roach PHP

    roach-php-bundle Symfony bundle for Roach PHP. Roach is a complete web scraping toolkit for PHP. It is a shameless clone heavily inspired by the popul

    Pisarev Alexey 7 Sep 28, 2022
    PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

    It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

    null 1 Mar 24, 2022
    PHP library to Scrape website into entity easily

    Scraper Scraper can handle multiple request type and transform them into object in order to create some API. Installation composer require rem42/scrap

    null 5 Dec 18, 2021
    Roach is a complete web scraping toolkit for PHP

    ?? Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

    Roach PHP 1.1k Jan 3, 2023
    🔥High Performance PHP Progressive Framework.

    The Core Framework English | 中文 The QueryPHP Application QueryPHP is a modern, high performance PHP progressive framework, to provide a stable and rel

    The QueryPHP Framework 306 Dec 14, 2022
    Framework for building extensible server-side progressive applications for modern PHP.

    Chevere ?? Subscribe to the newsletter to don't miss any update regarding Chevere. Framework for building extensible server-side progressive applicati

    Chevere 65 Jan 6, 2023