Hey,
First of all, thanks for this great package!
In the docs, there is an example of how to parse a set of articles from an overview page:
public function parse(Response $response): Generator
{
$links = $response->filter('header + div a')->links();
foreach ($links as $link) {
yield $this->request('GET', $link->getUri(), 'parseBlogPage');
}
}
public function parseBlogPage(Response $response): Generator
{
$title = $response->filter('h1')->text();
$publishDate = $response
->filter('time')
->attr('datetime');
$excerpt = $response
->filter('.blog-content div > p:first-of-type')
->text();
yield $this->item(compact('title', 'publishDate', 'excerpt'));
}
In a use case of mine, I would like to do something similar but split the parsing of the overview page and a specific blog page up into two separate Spiders. In the Spider that finds different articles, I would then like to delegate the parsing of a specific blog page to another Spider. For example, I'd like to do something like this:
class BlogOverviewSpider extends BasicSpider
{
public function parse(Response $response): Generator
{
$pages = $response
->filter('main > div:first-child a')
->links();
foreach ($pages as $page) {
yield $this->spider(BlogPageSpider::class, overrides: new Overrides([startUrls: $page->getUri()]));
}
}
}
class BlogPageSpider extends BasicSpider
{
public function parse(Response $response): Generator
{
yield $this->item([])
}
}
Here's a simplified example that's a bit more realistic and that demonstrates its usefulness.
Scraping metadata from different Git repositories
class RepositoryOverviewSpider extends BasicSpider
{
public function parse(Response $response): Generator
{
$repositories = $response
->filter('main > div:first-child a')
->links();
foreach ($repositories as $repository) {
if ($this->isGithubRepository($repository->getUri())) {
yield $this->spider(GithubRepositorySpider::class, overrides: new Overrides([startUrls: $repository->getUri()]))
} else if ($this->isGitlabRepository($repository->getUri())) {
yield $this->spider(GitlabRepositorySpider::class, overrides: new Overrides([startUrls: $repository->getUri()]))
} else {
yield $this->spider(GenericRepositorySpider::class, overrides: new Overrides([startUrls: $repository->getUri()]))
}
}
}
}
Here, each repository Spider could define its own authentication scheme and its own specific parsing method.
I could not find any way of using the result of another Spider in the docs. Most of the logic of starting a Spider seems to be locked behind a private API in the RoachPHP\Roach
class .
Maybe I've missed something and you can already compose Spiders in some way. If not, I think it could be a great feature.
If you also see the merit in this, I could try taking a stab at implementing this myself.
enhancement