Tailwind Logo

Sitecore Search - Using JavaScript in the Document Extractor

Search

Published: 2023-09-04

Let's say you want to specify the type of content on your website. For example, a blog could be set to "blog" and product information could be set to "products. In this article, we will introduce the procedure for processing this part with JavaScript.

Add Sources

This time, we will proceed to check the operation of the contents of www.sitecore.com. When actually working on the site, please read the domain part differently and understand the data structure of the site as you proceed.

First, create a new source.

searchsitecorecom01.png

Then open Web Crawler Settings and specify the domain.

searchsitecorecom02.png

The site sitecore.com is excluded from crawling below /search because of the way the search engine works. Set Glob Express for Exclusion patterns and /search for the value.

searchsitecorecom06.png

Also, as described in the previous article on obtaining sitemap.xml, set the User Agent and then configure it to crawl.

searchsitecorecom03.png

As for Available Locales, we will immediately assume en-us only.

Triggers is proceeding on the assumption that sitemap.xml is used, so we will use sitemap.xml this time.

searchsitecorecom04.png

For the Document Extractor, this time we will set up JavaScript.

searchsitecorecom05.gif

In the source code, we will try to run the program with just the default code for the first time.

JavaScript
// Sample extractor function. Change the function to suit your individual needs
function extract(request, response) {
    $ = response.body;

    return [{
        'description': $('meta[name="description"]').attr('content') || $('meta[property="og:description"]').attr('content') || $('p').text(),
        'name': $('meta[name="searchtitle"]').attr('content') || $('title').text(),
        'type': $('meta[property="og:type"]').attr('content') || 'website_content',
        'url': $('meta[property="og:url"]').attr('content')
    }];
}

This completes the initial setup. Once you have done this, please publish your site and check that it crawls properly. After a few moments, 1000 contents have been added.

searchsitecorecom07.png

Document Extractor changes

In terms of retrieving and storing data from the HTML structure, the Javascript code described above works much the same as XPath. Therefore, this time, we have rewritten the type as follows so that the setting can be changed using URLs.

JavaScript
function extract(request, response) {
    $ = response.body;

    let url = request.url;
    let subtype;

    if (url.includes('/products/')) {
        subtype = 'Products';
    } else if (url.includes('/solutions/')) {
        subtype = 'Solutions';
    } else if (url.includes('/knowledge-center/')) {
        subtype = 'Knowledge Center';
    } else if (url.includes('/partners/')) {
        subtype = 'Partners';
    } else if (url.includes('/company/')) {
        subtype = 'Company';
    } else {
        subtype = 'website';
    }

    return [{
        'title': $('meta[name="searchtitle"]').attr('content') || $('title').text(),
        'subtitle': $('meta[name="description"]').attr('content') || $('meta[property="og:description"]').attr('content') || $('p').text(),
        'description': $('meta[name="description"]').attr('content') || $('meta[property="og:description"]').attr('content') || $('p').text(),
        'name': $('meta[name="searchtitle"]').attr('content') || $('title').text(),
        'type': subtype ,
        'url': $('meta[property="og:url"]').attr('content')
    }];
}

Change the settings and crawl again. After a while, the data is aligned in the content as follows.

searchsitecorecom08.png

When content types are added to the filter, candidates are displayed as shown below.

searchsitecorecom09.png

Summary

In this case, we used the URL to specify the content type, and while it is effective to include the data contained in the og tag, there may be cases where older content does not contain an og tag.

Tags