Tailwind Logo

Sitecore Search - PDF Handling

Search

Published: 2023-09-08

Sitecore Search can search for documents in addition to HTML. In this article, we will show you how to target PDF files.

Check Attribute

In a previous article, we had already added the following two items as Attributes in utilizing this SDK.

  • File Type
  • Parent_url

We would like to use this to attach PDF attributes to PDF files.

Crawler Settings

Add a target domain

In offering PDFs on sitecore.com, PDF files use a different domain. For example, the following pages

The URL for the PDF file is https://wwwsitecorecom.azureedge.net/-/media/sitecoresite/files/home/customers/technology/canon/2020-canon-jp.pdf?md= 20200622T142037Z and is available for download via CDN. This domain is not included in the current crawler, so we need to add it first.

For this reason, add the domain to the settings of the source you are crawling.

Max Depth setting

This item allows you to set how many links contained in a page will be followed and indexed. This setting is described in the following page,

By default, this value is 0 when sitemap and sitemap index are selected, and 2 otherwise. Since we are using sitemap in this case, we need to set this value to 1.

searchpdf01,png.png

Now we can crawl and index PDF files if they are linked as we crawl. However, to search only PDFs, we have added the following process.

Add PDF processing in Document Extractor

So far, the data acquired by the crawler has been processed by a single Document Extractor, but this time, we would like to process PDF data in a different way.

First, add PDF processing as part of Document Extractor processing. The contents to be added are as follows

  • Name as PDF
  • Select JavaScript for processing
  • For URLs to Match, select Glob Expression and set **/*.pdf* to process only PDF extensions.
    • When referring to Sitecore's Media Library, the key is added after the .pdf, so it looks like the above, but you don't need the trailing * if it ends in .pdf
  • The order of processing is placed before the already existing JS processing
searchpdf02.png

Click on Add tagger and add the following processing as JavaScript

JavaScript
function extract(request, response) {
    const translate_re = /&(nbsp|amp|quot|lt|gt);/g;

    function decodeEntities(encodedString) {
        return encodedString.replace(translate_re, function(match, entity) {
            return translate[entity];
        }).replace(/&#(\d+);/gi, function(match, numStr) {
            const num = parseInt(numStr, 10);
            return String.fromCharCode(num);
        });
    }

    function sanitize(text) {
        return text ? decodeEntities(String(text).trim()) : text;
    }

    $ = response.body;
    url = request.url;
    id = url.replace(/[.:/&?=%]/g, '_');
    title = sanitize($('title').text());
    description = $('body').text().substring(0, 7000);

    $p = request.context.parent.response.body;

    if (title.length <= 4 && $p) {
        title = $p('title').text();
    }

    parentUrl = request.context.parent.request.url;
    type = request.context.parent.documents[0].data.type;
    last_modified = request.context.parent.documents[0].data.last_modified;

    return [{
        'id': id,
        'file_type': 'pdf',
        'type': type,
        'last_modified': last_modified,
        'title': title,
        'description': description,
        'parent_url': parentUrl
    }];
}

If the Localized checkbox is also checked and the screen looks like the one below, the standard crawl settings are complete.

searchpdf03.png

Locale is also set for PDF files, but the URL of the PDF file itself does not include the locale. Therefore, we use parent_url to store the locale for each language using the locale of the linked page below. The code looks like this

JavaScript
function extract(request, response) {
    parentUrl = request.context.parent.request.url;
    locales = ['zh-cn','de-de','ja-jp','da']; 
    for (idx in locales) {
        locale = locales[idx];
        if (parentUrl.indexOf('/' + locale + '/') >= 0) {
            if (locale == 'da')
                locale = 'da-dk';
            return locale.toLowerCase().replace('-','_');
        }
    }
    return "en_us";
}

The screen after setting is as follows.

searchpdf04.png

Finally, we add processing for PDFs in the Request Extractor, since PDFs are not crawled by default and this additional processing is necessary.

JavaScript
function extract(request, response) {
  const $ = response.body;
  const regex = /.*\.pdf(?:\?.*)?$/;
  return $('a')
    .toArray()
    .map((a) => $(a).attr('href'))
    .filter((url) => regex.test(url))
    .map((url) => ({ url }));
}
searchpdf05.png

If the Document Extractor and Locale Extractor settings have been configured for the source, the screen will look like this

searchpdf06.png

Click Publish to run the crawl with the new settings (this time increasing the target content to 6000).

Content Verification

We will check the actual content that is ready to be completed. From the list of content that is ready to be crawled from the content in the administration screen, specify the target source and file type, and you will see a list of PDF files. If you check the attributes of the content, you will see that PDF is set.

searchpdf07.gif

Summary

This time we also added PDF to the crawl and verified that the file type is available as a PDF file in the content list.

Tags