I am trying to scrape a page that has paginated links at the bottom. In the roach docs I have found that you could override the initialRequest
to find other URL's to scrape.
This is working as expected:
class ExampleSpider extends BasicSpider
{
public function parseOverview(Response $response): \Generator
{
$pageUrls = array_map(
function (Link $link) {
return $link->getUri();
},
$response
->filter('.pages-items li a')
->links(),
);
foreach ($pageUrls as $pageUrl) {
// Since we’re not specifying the second parameter,
// all article pages will get handled by the
// spider’s `parse` method.
yield $this->request('GET', $pageUrl);
}
}
public function parse(Response $response): \Generator
{
$items = $response->filter('.product-item')->each(function (Crawler $product, $i) {
$productName = $product->filter('.product-item-link');
$array['product_name'] = $productName->count() ? $productName->text() : null;
$link = $product->filter('.product-item-link');
$array['link'] = $link->count() ? $link->link()->getUri() : null;
$imageUrl = $product->filter('.product-image-photo');
$array['image_url'] = $imageUrl->count() ? $imageUrl->image()->getUri() : null;
$salePrice = $product->filter('.price-final_price .price');
$array['sale_price'] = $salePrice->count() ? $salePrice->text() : null;
$regularPrice = $product->filter('.old-price span.price');
$array['regular_price'] = $regularPrice->count() ? $regularPrice->text() : null;
$attributeSize = $product->filter('.attribute.size');
$array['attribute_size'] = $attributeSize->count() ? $attributeSize->text() : null;
$savings = $product->filter('.sticker-wrapper');
$array['savings'] = $savings->count() ? $savings->text() : null;
return $array;
});
foreach ($items as $item) {
if (!$item) {
continue;
}
yield $this->item($item);
}
}
/** @return Request[] */
protected function initialRequests(): array
{
return [
new Request(
'GET',
'https://www.example.com/5-pages', // Has 5 pages
[$this, 'parseOverview']
),
new Request(
'GET',
'https://www.example.com/1-page', // Has 1 page (no pagination)
[$this, 'parseOverview']
),
];
}
}
However, this only scrapes the pages that are gathered parseOverview()
method. I would also like to use the $response
object from the first page (https://www.example.com/5-pages) and not only:
- https://www.example.com/5-pages?page=2
- https://www.example.com/5-pages?page=3
- https://www.example.com/5-pages?page=4
- https://www.example.com/5-pages?page=5
So I figured, as we have the first page already in the Response, I'll try running the $this->parse()
method on the $response
object in the parseOverview()
method:
public function parseOverview(Response $response): \Generator
{
yield $this->parse($response); // Here I try yielding the parse() method using the response object from the first page
$pageUrls = array_map(
function (Link $link) {
return $link->getUri();
},
$response
->filter('.pages-items li a')
->links(),
);
foreach ($pageUrls as $pageUrl) {
// Since we’re not specifying the second parameter,
// all article pages will get handled by the
// spider’s `parse` method.
yield $this->request('GET', $pageUrl);
}
}
However, when running the Spider I get the following error: Call to undefined method Generator::value()
I tried adding the first page url to the array $pageUrls
, but then I get a DuplicatedRequest
. This is good because I do not want to fire the request twice when we already have a working Response
object.
What do you recommend to change to make sure I get the data of the first page also?