Sitecore Search - PDF Handling : Haramizu.com

Language:

Sitecore Search - PDF Handling

Published: 2023-09-08

Sitecore Search can search for documents in addition to HTML. In this article, we will show you how to target PDF files.

Check Attribute

In a previous article, we had already added the following two items as Attributes in utilizing this SDK.

File Type
Parent_url

We would like to use this to attach PDF attributes to PDF files.

Sitecore Search - Additional steps to get the Starter Kit working - Adding Attributes

Crawler Settings

Add a target domain

In offering PDFs on sitecore.com, PDF files use a different domain. For example, the following pages

New intranet moves Canon into future

The URL for the PDF file is https://wwwsitecorecom.azureedge.net/-/media/sitecoresite/files/home/customers/technology/canon/2020-canon-jp.pdf?md= 20200622T142037Z and is available for download via CDN. This domain is not included in the current crawler, so we need to add it first.

For this reason, add the domain to the settings of the source you are crawling.

Max Depth setting

This item allows you to set how many links contained in a page will be followed and indexed. This setting is described in the following page,

Configuring crawler settings

By default, this value is 0 when sitemap and sitemap index are selected, and 2 otherwise. Since we are using sitemap in this case, we need to set this value to 1.

Now we can crawl and index PDF files if they are linked as we crawl. However, to search only PDFs, we have added the following process.

Add PDF processing in Document Extractor

So far, the data acquired by the crawler has been processed by a single Document Extractor, but this time, we would like to process PDF data in a different way.

First, add PDF processing as part of Document Extractor processing. The contents to be added are as follows

Name as PDF
Select JavaScript for processing
For URLs to Match, select Glob Expression and set **/*.pdf* to process only PDF extensions.

When referring to Sitecore's Media Library, the key is added after the .pdf, so it looks like the above, but you don't need the trailing * if it ends in .pdf

The order of processing is placed before the already existing JS processing

Click on Add tagger and add the following processing as JavaScript

JavaScript

function extract(request, response) {
    const translate_re = /&(nbsp|amp|quot|lt|gt);/g;

    function decodeEntities(encodedString) {
        return encodedString.replace(translate_re, function(match, entity) {
            return translate[entity];
        }).replace(/&#(\d+);/gi, function(match, numStr) {
            const num = parseInt(numStr, 10);
            return String.fromCharCode(num);
        });
    }

    function sanitize(text) {
        return text ? decodeEntities(String(text).trim()) : text;
    }

    $ = response.body;
    url = request.url;
    id = url.replace(/[.:/&?=%]/g, '_');
    title = sanitize($('title').text());
    description = $('body').text().substring(0, 7000);

    $p = request.context.parent.response.body;

    if (title.length <= 4 && $p) {
        title = $p('title').text();
    }

    parentUrl = request.context.parent.request.url;
    type = request.context.parent.documents[0].data.type;
    last_modified = request.context.parent.documents[0].data.last_modified;

    return [{
        'id': id,
        'file_type': 'pdf',
        'type': type,
        'last_modified': last_modified,
        'title': title,
        'description': description,
        'parent_url': parentUrl
    }];
}

If the Localized checkbox is also checked and the screen looks like the one below, the standard crawl settings are complete.

Locale is also set for PDF files, but the URL of the PDF file itself does not include the locale. Therefore, we use parent_url to store the locale for each language using the locale of the linked page below. The code looks like this

JavaScript

function extract(request, response) {
    parentUrl = request.context.parent.request.url;
    locales = ['zh-cn','de-de','ja-jp','da']; 
    for (idx in locales) {
        locale = locales[idx];
        if (parentUrl.indexOf('/' + locale + '/') >= 0) {
            if (locale == 'da')
                locale = 'da-dk';
            return locale.toLowerCase().replace('-','_');
        }
    }
    return "en_us";
}

The screen after setting is as follows.

Finally, we add processing for PDFs in the Request Extractor, since PDFs are not crawled by default and this additional processing is necessary.

JavaScript

function extract(request, response) {
  const $ = response.body;
  const regex = /.*\.pdf(?:\?.*)?$/;
  return $('a')
    .toArray()
    .map((a) => $(a).attr('href'))
    .filter((url) => regex.test(url))
    .map((url) => ({ url }));
}

If the Document Extractor and Locale Extractor settings have been configured for the source, the screen will look like this

Click Publish to run the crawl with the new settings (this time increasing the target content to 6000).

Content Verification

We will check the actual content that is ready to be completed. From the list of content that is ready to be crawled from the content in the administration screen, specify the target source and file type, and you will see a list of PDF files. If you check the attributes of the content, you will see that PDF is set.

Summary

This time we also added PDF to the crawl and verified that the file type is available as a PDF file in the content list.