Beanbun 是用 PHP 编写的多进程网络爬虫框架，具有良好的开放性、高可扩展性，基于 Workerman

Kidd Yu

Last update: Dec 19, 2022

Related tags

Overview

简介

Beanbun 是一个简单可扩展的爬虫框架，支持分布式，支持守护进程模式与普通模式，守护进程模式基于 Workerman，下载器基于 Guzzle。

文档

https://github.com/kiddyuchina/Beanbun/blob/master/docs/chs/README.md

特点

支持守护进程与普通两种模式（守护进程模式只支持 Linux 服务器）
默认使用 guzzle 进行爬取
支持分布式
支持内存、Redis 等多种队列方式
支持自定义URI过滤
支持广度优先和深度优先两种爬取方式
遵循 PSR-4 标准
爬取网页分为多步，每步均支持自定义动作（如添加代理、修改 user-agent 等）
灵活的扩展机制，可方便的为框架制作插件：自定义队列、自定义爬取方式...

安装

Beanbun 可以通过 composer 进行安装。

$ composer require kiddyu/beanbun

快速开始

创建一个文件 start.php，包含以下内容


use Beanbun\Beanbun;
$beanbun = new Beanbun;
$beanbun->seed = [
	'http://www.950d.com/',
	'http://www.950d.com/list-1.html',
	'http://www.950d.com/list-2.html',
];
$beanbun->afterDownloadPage = function($beanbun) {
	file_put_contents(__DIR__ . '/' . md5($beanbun->url), $beanbun->page);
};
$beanbun->start();

在命令行中执行

$ php start.php

接下来就可以看到抓取的日志了。

插件

beanbun-parser 数据抽取插件 https://github.com/kiddyuchina/beanbun-parser

更多详细内容，请查看文档

Comments

请问一直找不到beadbun这个类是哪里的原因呢？

[root@localhost www]# ls composer.json composer.lock vendor [root@localhost www]# vim start.php [root@localhost www]# php start.php PHP Fatal error: Uncaught Error: Class 'Beanbun\Beanbun' not found in /www/start.php:3 Stack trace: #0 {main} thrown in /www/start.php on line 3 [root@localhost www]#

这是刚刚用composer安装好，然后复制了start.php执行的结果

opened by voocel 4
遇到网站返回 HTTP 错误，但爬虫不会停止，一直无限爬？

遇到某些 Seed 有时返回 500、或404、或超时，爬虫会一直重试，然后好像在爬一个空地址，而且也不进入afterDownloadPage。

1、网站故障无法避免，但爬虫应该要怎样正确处理这种情况？ 2、在afterDownloadPage里除了 page 属性以外，能否获得网站返回的 http code 、Response Header Cookie 这些？？

opened by xtremforce 1
Add redis db and prefix
Rt. And you should think about the redis that need timeout set. Eg.

$this->redis->connect($this->config['host'], $this->config['port'], $this->config['timeout']);

thanks!
opened by taozywu 0
关于post问题

例子中都是get的，没有post的例程，不知道data数据在那里设定啊？能不能举一个post的例子呢。 $beanbun->seed = [ //'http://www.950d.com/', [ 'http://www.950d.com/list-2.html', [ 'method' => 'POST',
] ] ]; 按例程中，这样设定，一是没有postdata数据，二是就算这样，也报错。crul error 3.

麻烦给个post的例子，谢谢

opened by googles8 0
cURL error 28: Operation timed out

cURL error 28: Operation timed out after 60001 milliseconds with 49054 bytes received (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)

请问是什么原因,怎么解决?

opened by boorcode 3

Beanbun整合到yii2框架时，无法运行

use yii\console\Controller;
use Beanbun\Beanbun;

class DemoController extends Controller
{

    public function actionBeanbun()
    {
        $beanbun = new Beanbun();
        $beanbun->seed = [
            'http://www.950d.com/',
            'http://www.950d.com/list-1.html',
            'http://www.950d.com/list-2.html',
        ];
        $beanbun->afterDownloadPage = function ($beanbun) {
            file_put_contents(__DIR__ . '/' . md5($beanbun->url), $beanbun->page);
        };
        $beanbun->start();
    }
}

问题一

PHP Notice 'yii\base\ErrorException' with message 'Undefined property: Beanbun\Beanbun::$count' in /xxx/vendor/kiddyu/beanbun/src/Beanbun.php:136

问题二在Beanbun加上count属性后，发现依然不能工作

会一直提示

Usage: php yourfile.php {start|stop|restart|reload|status|connections} [-d]

opened by aaronlei 0

Releases(1.0.4)

1.0.4(Jun 11, 2017)

Source code(tar.gz)
Source code(zip)
1.0.3(Apr 27, 2017)

Source code(tar.gz)
Source code(zip)
1.0.2(Apr 27, 2017)

Source code(tar.gz)
Source code(zip)
1.0.1(Apr 16, 2017)

Source code(tar.gz)
Source code(zip)
1.0.0(Apr 13, 2017)

Source code(tar.gz)
Source code(zip)

Owner

Kidd Yu

GitHub

Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

9.1k Jan 1, 2023

A configurable and extensible PHP web spider

Note on backwards compatibility break: since v0.5.0, Symfony EventDispatcher v3 is no longer supported and PHP Spider requires v4 or v5. If you are st

1.3k Dec 28, 2022

A browser testing and web crawling library for PHP and Symfony

A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tes

2.7k Dec 31, 2022

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

crawlerdetect.io About CrawlerDetect CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Current

1.7k Dec 30, 2022

:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

QueryList QueryList is a simple, elegant, extensible PHP Web Scraper (crawler/spider) ,based on phpQuery. API Documentation 中文文档 Features Have the sam

2.5k Dec 27, 2022

Goutte, a simple PHP Web Scraper

Goutte, a simple PHP Web Scraper Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extrac

9.1k Jan 4, 2023

PHP Discord Webcrawler to log all messages from a Discord Chat.

Disco the Ripper was created to rip all messages from a Discord specific channel into JSON via CLI and help people to investigate some servers who has awkward channels before they get deleted.

46 Sep 21, 2022

PHP DOM Manipulation toolkit.

phpQuery The PHP DOM Manipulation toolkit. Motivation I'm working currently with PHP, and I've missed using something like jQuery in PHP to manipulate

1 Nov 26, 2021

This Project is for digikala.com scrapping challenge of 2021 blackfriday using php/laravel/horizon

Objective This script is intended for finding the hidden treasure, A scraping challenge by digikala for 2021 black Friday Prerequisites Php mysql redi

1 Dec 22, 2021

Symfony bundle for Roach PHP

roach-php-bundle Symfony bundle for Roach PHP. Roach is a complete web scraping toolkit for PHP. It is a shameless clone heavily inspired by the popul

7 Sep 28, 2022

PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

It can scrape ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

1 Mar 24, 2022

PHP library to Scrape website into entity easily

Scraper Scraper can handle multiple request type and transform them into object in order to create some API. Installation composer require rem42/scrap

5 Dec 18, 2021

Roach is a complete web scraping toolkit for PHP

?? Roach A complete web scraping toolkit for PHP About Roach is a complete web scraping toolkit for PHP. It is heavily inspired (read: a shameless clo

1.1k Jan 3, 2023

💫 Vega is a CLI mode HTTP web framework written in PHP support Swoole, WorkerMan / Vega 是一个用 PHP 编写的 CLI 模式 HTTP 网络框架，支持 Swoole、WorkerMan

Mix Vega 中文 | English Vega is a CLI mode HTTP web framework written in PHP support Swoole, WorkerMan Vega 是一个用 PHP 编写的 CLI 模式 HTTP 网络框架，支持 Swoole、Work

46 Apr 28, 2022

Beanbun 是用 PHP 编写的多进程网络爬虫框架，具有良好的开放性、高可扩展性，基于 Workerman

Related tags

Overview

简介

文档

特点

安装

快速开始

插件

Comments

请问一直找不到beadbun这个类是哪里的原因呢？

遇到网站返回 HTTP 错误，但爬虫不会停止，一直无限爬？

Add redis db and prefix

关于post问题

cURL error 28: Operation timed out

Beanbun整合到yii2框架时，无法运行

Releases(1.0.4)

1.0.4(Jun 11, 2017)

1.0.3(Apr 27, 2017)

1.0.2(Apr 27, 2017)

1.0.1(Apr 16, 2017)

1.0.0(Apr 13, 2017)

Owner

Kidd Yu

Goutte, a simple PHP Web Scraper

A configurable and extensible PHP web spider

A browser testing and web crawling library for PHP and Symfony

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

Goutte, a simple PHP Web Scraper

PHP Discord Webcrawler to log all messages from a Discord Chat.

PHP DOM Manipulation toolkit.

This Project is for digikala.com scrapping challenge of 2021 blackfriday using php/laravel/horizon

Symfony bundle for Roach PHP

PHP scraper for ZEE5 Live Streaming URL's Using The Channel ID and Direct Play Anywhere

PHP library to Scrape website into entity easily

Roach is a complete web scraping toolkit for PHP

💫 Vega is a CLI mode HTTP web framework written in PHP support Swoole, WorkerMan / Vega 是一个用 PHP 编写的 CLI 模式 HTTP 网络框架，支持 Swoole、WorkerMan

Websocket chat room written in PHP based on workerman.

High performance HTTP Service Framework for PHP based on Workerman.

Workerman Redis watcher for PHP-Casbin

A server side alternative implementation of socket.io in PHP based on workerman.

Laravel && ( Swoole || Workerman ) to get 10x faster than php-fpm

☄️ PHP CLI mode development framework, supports Swoole, WorkerMan, FPM, CLI-Server