Beanbun 是用 PHP 编写的多进程网络爬虫框架,具有良好的开放性、高可扩展性,基于 Workerman

Overview

Build Status License Sauce Test Status

简介

Beanbun 是一个简单可扩展的爬虫框架,支持分布式,支持守护进程模式与普通模式,守护进程模式基于 Workerman,下载器基于 Guzzle

文档

https://github.com/kiddyuchina/Beanbun/blob/master/docs/chs/README.md

特点

  • 支持守护进程与普通两种模式(守护进程模式只支持 Linux 服务器)
  • 默认使用 guzzle 进行爬取
  • 支持分布式
  • 支持内存、Redis 等多种队列方式
  • 支持自定义URI过滤
  • 支持广度优先和深度优先两种爬取方式
  • 遵循 PSR-4 标准
  • 爬取网页分为多步,每步均支持自定义动作(如添加代理、修改 user-agent 等)
  • 灵活的扩展机制,可方便的为框架制作插件:自定义队列、自定义爬取方式...

安装

Beanbun 可以通过 composer 进行安装。

$ composer require kiddyu/beanbun

快速开始

创建一个文件 start.php,包含以下内容


use Beanbun\Beanbun;
$beanbun = new Beanbun;
$beanbun->seed = [
	'http://www.950d.com/',
	'http://www.950d.com/list-1.html',
	'http://www.950d.com/list-2.html',
];
$beanbun->afterDownloadPage = function($beanbun) {
	file_put_contents(__DIR__ . '/' . md5($beanbun->url), $beanbun->page);
};
$beanbun->start();

在命令行中执行

$ php start.php

接下来就可以看到抓取的日志了。

插件

更多详细内容,请查看 文档

Comments
  • 请问一直找不到beadbun这个类是哪里的原因呢?

    请问一直找不到beadbun这个类是哪里的原因呢?

    Uploading 6666.png…

    [root@localhost www]# ls composer.json composer.lock vendor [root@localhost www]# vim start.php [root@localhost www]# php start.php PHP Fatal error: Uncaught Error: Class 'Beanbun\Beanbun' not found in /www/start.php:3 Stack trace: #0 {main} thrown in /www/start.php on line 3 [root@localhost www]#

    这是刚刚用composer安装好,然后复制了start.php执行的结果

    opened by voocel 4
  • 遇到网站返回 HTTP 错误,但爬虫不会停止,一直无限爬?

    遇到网站返回 HTTP 错误,但爬虫不会停止,一直无限爬?

    遇到某些 Seed 有时返回 500、或404、或超时,爬虫会一直重试,然后好像在爬一个空地址,而且也不进入afterDownloadPage。

    1、网站故障无法避免,但爬虫应该要怎样正确处理这种情况? 2、在afterDownloadPage里除了 page 属性以外,能否获得网站返回的 http code 、Response Header Cookie 这些??

    opened by xtremforce 1
  • Add redis db and prefix

    Add redis db and prefix

    Rt. And you should think about the redis that need timeout set. Eg.

    $this->redis->connect($this->config['host'], $this->config['port'], $this->config['timeout']);
    

    thanks!

    opened by taozywu 0
  • 关于post问题

    关于post问题

    例子中都是get的,没有post的例程,不知道data数据在那里设定啊?能不能举一个post的例子呢。 $beanbun->seed = [ //'http://www.950d.com/', [ 'http://www.950d.com/list-2.html', [ 'method' => 'POST',
    ] ] ]; 按例程中,这样设定,一是没有postdata数据,二是就算这样,也报错。crul error 3.

    麻烦给个post的例子,谢谢

    opened by googles8 0
  • cURL error 28: Operation timed out

    cURL error 28: Operation timed out

    cURL error 28: Operation timed out after 60001 milliseconds with 49054 bytes received (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)

    请问是什么原因,怎么解决?

    opened by boorcode 3
  • Beanbun整合到yii2框架时,无法运行

    Beanbun整合到yii2框架时,无法运行

    use yii\console\Controller;
    use Beanbun\Beanbun;
    
    class DemoController extends Controller
    {
    
        public function actionBeanbun()
        {
            $beanbun = new Beanbun();
            $beanbun->seed = [
                'http://www.950d.com/',
                'http://www.950d.com/list-1.html',
                'http://www.950d.com/list-2.html',
            ];
            $beanbun->afterDownloadPage = function ($beanbun) {
                file_put_contents(__DIR__ . '/' . md5($beanbun->url), $beanbun->page);
            };
            $beanbun->start();
        }
    }
    

    问题一

    PHP Notice 'yii\base\ErrorException' with message 'Undefined property: Beanbun\Beanbun::$count' in /xxx/vendor/kiddyu/beanbun/src/Beanbun.php:136

    问题二 在Beanbun加上count属性后,发现依然不能工作

    会一直提示

    Usage: php yourfile.php {start|stop|restart|reload|status|connections} [-d]

    opened by aaronlei 0
Releases(1.0.4)
Owner
Kidd Yu
Kidd Yu
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 1, 2023
A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

Matthijs van den Bos 1.3k Dec 28, 2022
A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

Symfony 2.7k Dec 31, 2022
🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

crawlerdetect.io About CrawlerDetect CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Current

Mark Beech 1.7k Dec 30, 2022
:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

QueryList QueryList is a simple, elegant, extensible PHP Web Scraper (crawler/spider) ,based on phpQuery. API Documentation 中文文档 Features Have the sam

Jaeger(黄杰) 2.5k Dec 27, 2022
Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

null 9.1k Jan 4, 2023
PHP Discord Webcrawler to log all messages from a Discord Chat.

Disco the Ripper was created to rip all messages from a Discord specific channel into JSON via CLI and help people to investigate some servers who has awkward channels before they get deleted.

Daniel Reis 46 Sep 21, 2022
PHP DOM Manipulation toolkit.

phpQuery The PHP DOM Manipulation toolkit. Motivation I'm working currently with PHP, and I've missed using something like jQuery in PHP to manipulate

João Eduardo Fornazari 1 Nov 26, 2021
This Project is for digikala.com scrapping challenge of 2021 blackfriday using php/laravel/horizon

Objective This script is intended for finding the hidden treasure, A scraping challenge by digikala for 2021 black Friday Prerequisites Php mysql redi

ǃшɒʞɒH ǃǀɄ 1 Dec 22, 2021
Symfony bundle for Roach PHP

roach-php-bundle Symfony bundle for Roach PHP. Roach is a complete web scraping toolkit for PHP. It is a shameless clone heavily inspired by the popul

Pisarev Alexey 7 Sep 28, 2022
PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

null 1 Mar 24, 2022
PHP library to Scrape website into entity easily

Scraper Scraper can handle multiple request type and transform them into object in order to create some API. Installation composer require rem42/scrap

null 5 Dec 18, 2021
Roach is a complete web scraping toolkit for PHP

?? Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

Roach PHP 1.1k Jan 3, 2023
💫 Vega is a CLI mode HTTP web framework written in PHP support Swoole, WorkerMan / Vega 是一个用 PHP 编写的 CLI 模式 HTTP 网络框架,支持 Swoole、WorkerMan

Mix Vega 中文 | English Vega is a CLI mode HTTP web framework written in PHP support Swoole, WorkerMan Vega 是一个用 PHP 编写的 CLI 模式 HTTP 网络框架,支持 Swoole、Work

Mix PHP 46 Apr 28, 2022
Websocket chat room written in PHP based on workerman.

基于workerman的GatewayWorker框架开发的一款高性能支持分布式部署的聊天室系统。

walkor 1.1k Jan 8, 2023
High performance HTTP Service Framework for PHP based on Workerman.

webman High performance HTTP Service Framework for PHP based on Workerman. Manual https://www.workerman.net/doc/webman Benchmarks https://www.techempo

walkor 1.3k Jan 2, 2023
Workerman Redis watcher for PHP-Casbin

Workerman Redis watcher for PHP-Casbin, Casbin is a powerful and efficient open-source access control library.

PHP-Casbin 4 Mar 23, 2022
A server side alternative implementation of socket.io in PHP based on workerman.

phpsocket.io A server side alternative implementation of socket.io in PHP based on Workerman. Notice Only support socket.io v1.3.0 or greater. This pr

walkor 2.1k Jan 6, 2023
Laravel && ( Swoole || Workerman ) to get 10x faster than php-fpm

Laravoole Laravel on Swoole Or Workerman 10x faster than php-fpm Depends On php >=5.5.16 laravel/laravel ^ 5.1 Suggests php >=7.0.0 ext-swoole >=1.7.2

Garveen 893 Dec 23, 2022
☄️ PHP CLI mode development framework, supports Swoole, WorkerMan, FPM, CLI-Server

☄️ PHP CLI mode development framework, supports Swoole, WorkerMan, FPM, CLI-Server / PHP 命令行模式开发框架,支持 Swoole、WorkerMan、FPM、CLI-Server

Mix PHP 1.8k Jan 3, 2023