Regular Expression Converter for Watson Explorer Engine

Sometimes it’s useful to extract data from a Watson Explorer content node using regular expressions. In this post, I’ll show you how to extract data using a regular expression and create a new content node for that specific data.

To start off we will use the default example-metadata collection. We will attempt to extract any 3 digit number from the snippet content to make the regex easy. You can do much more advanced regular expressions if necessary.

First go to the example-metadata collection and click “test-it”

Then click on “Test-it” next to the first result:

Now scroll down and look at the output of the ” Create Metadata from Content” converter:

In the output, you will see the snippet content has the number 500 in it.


We will make a converter that will extract any 3 digit number into a new content. First, add a new converter:

Select the Regex entity extraction converter and click Add.

In the converter configuration, in the list of entities node names enter “my-regex-node” and the target node of “snippet”. Then click OK.

Now on the sidebar of WEX click the + next to XML.

enter the following names:

Now update the xml node to include your regular expression like below. Note that my regex is “[0-9]{3}” to match 3 digits. Save the node.

Return to the collection and do a test-it, as we did above, down to that same first result. If you look at the converter trace you will see the regex converter is running.

Click on the 910 output to see your new content node:

Now you can use the new “regex-rule” content in your search application.

Using crawl-url nodes to enqueue data along with a url in Watson Explorer

viv:crawl-enqueue-url is documented as having 1 argument – the url to enqueue.
However, it has an optional second form with two arguments which allows much more flexibility. If the first argument is set to the url and the second is a variable containing a crawl-url, that also works. (note the use of entities to create the content tags)
For example:
When www.someurl.com is crawled, the content ‘inside’ will be added to it (after running through the normalization converter, probably), and from the converter, the attributes ‘url’ (automatically added) and ‘another-attribute’ (manually added) will be available to you via viv:current-node()/@attribute-name. If you want to add multiple contents in your crawl-data node, you’ll need to give them a root node – ‘document’ will work:
The prototype of the two-argument form actually looks something like: viv:crawl-enqueue-url(object, node). If the first argument evaluates to true, it is used as the url attribute on the node. That’s why the url is automatically added, above. If it evaluates to false, the node is used as-is. Thus, an equivalent form of the above is:

 

Watson Explorer XSL Tips and Tricks

Using the Chico Application to Test XSLT

Here’s an example use of Chico! I was testing something to do with evaluating XSLT and wanted to run it quickly:
Get to chico by going to your Velocity’s ‘velocity’ script and adding v.app=chico. Enter AXL in the box on the left, then submit it to see the processed results on the right.

Match   within strings

Use to match non-breaking spaces within strings in XSL:


For loop with XSL

You can emulate a for loop in XSL by creating a string of some length and tokenizing it:

If you intend on using this “for-loop” in a converter and you are taking data from a web page then you must do two things to make it work.

  • Save the entire page in a xsl variable outside the loop. Once in the loop you will lose access to the webpage.
  • If you need to iterate in your xpaths then save position() in a variable. If you try to call it directly in your xpaths then they will not work correctly.


Get a random number within a specific range


Copy a nodeset with special processing
  • Use the following XSL to copy a nodeset verbatim. Enter any special processing templates in between the comments as indicated


Boost parser

  • Based on the above, this is a generic recipe for a parser you can add to a source where you want to boost its results. You’ll probably want to edit the values in uppercase. Note that we’re throwing away binning information and boost-onlying the results.
PROTIP:If you are adding this boost parser to a source which accesses a Velocity Search Engine collection, you **must** set the parser type toxsland **not**html-xslor you may spend hours debugging your parser. You’ll know if you made this mistake when the content nodes are empty.


Parse XML that uses an XML namespace

Define a new prefix with the xmlns attribute in each xsl:template and then prepend your new prefix to each XPath (in this case, I’ve set the namespace to the ‘a’ prefix).


Parse an XML file from the command line
  • Use transform (from your checkout/vivisimo/util directory):

You can set up an alias in your .bash_profile if you don’t want to type the whole path every time.

Or you can cd to your installation directory and run from there, then the software will find your vivisimo.conf.
  • Stub XSL for a starting point:

Using xsl:key

For nodesets that are frequently accessed, creating a hash lookup with xsl:key() can significantly improve performance (in my experience, accessing a value using a key is about 0.07ms or about as fast as accessing a variable value). Here’s a simple example of creating a key and using it on the /*/settings node in the display:
For a given input XML of:

You can create a key that compares against [email protected] of eachsettingnode:


Then you can access the value of that setting node with a call tokey(). The first parameter is the name of the key and the second parameter is the value to compare against [email protected]


Using a nodeset/result-tree as the scope and context[edit]

Normally in XSL when you execute a command, the global context for xpaths is the incoming XML and the scope is the current node.
The easiest way to change the scope is to use<apply-templates />; the new scope is whatever node that the template matches on. However, the context still remains the incoming XML.
To use a completely different context (and naturally the scope), you can use xsl:for-each with a constructed nodeset.
Example:

actually returns


This can be used with document()and AXL variables (which don’t have any context and will generate an error if you use a key() in them).

Note that running it in a exsl:node-set() context (or for that matter any context that you don’t really know about (i.e., anything other than / and document()) can generate weird results:

actually returns


Grouping – The Muenchian Method[edit]

Suppose you have the following xml:

To list each city grouped by state, use the following xsl:

  • The first line produces a key called location-key for each location using the value of state.
  • The outer for-each loop scans through all location nodes whose state node is the first in the group of locations with that state value. In other words, it loops one time for each state.
  • The inner for-each loop scans through all location nodes whose state node has the value of the current state node. In other words, it loops once for each city in the current state.
The above XSL will produce something like this:

The method described above is especially great for grouping large amounts of data as it is orders of magnitude faster than manually scanning through every node and comparing it with siblings.

One place that I found a particularly crazy use for the Muenchian Method was in a converter. I wanted to output the sum of numeric values in the 2nd column of an html table for groups of rows. A group of rows in the table was defined by all rows having the same values in their 6th, 3rd and 1st columns. In that case, I wanted to make a key for each row in the table using a value of the concatenation of the 6th, 3rd, and 1st columns in the row.

Next, I used a for-each loop to get one row from each group.


Inside that loop, I found the sum of all the numeric values in column 2 of each row in the current group.


If your head isn’t hurting yet, a more detailed explanation of the Muenchian Method is available here


How to Enqueue Javascript Links

This is a question that comes up a lot from customers; how do I enqueue links on web pages that use Javascript?
It’s actually a pretty simple process, but the way Javascript links are parsed by the browser won’t work when using a converter. Below is a sample XHTML file that has a few Javascript links in it that we’d like to convert and enqueue.

Below is the stylesheet we would use in the converter to extract the links for this HTML file.


This is a simple example, but it can be used as a good starting point for customers.


Directory, basename, and extension extraction

Sometimes it’s quite useful to extract the directory paths, the basename, and the extension from a given path. These three templates will do that.

 

Observe thatget-basenameandget-extare basically the same thing, the exception being the difference in the separator. These two templates could be combined in a singleafter-lasttemplate which takes as params the string and the separator.
Note that the behavior when an extension is not present is not defined. A better recursion termination condition would fix that.

Empty content remover

Database seeds sometimes generate empty content elements. This is bad form for the finished product. This customer XSL template will remove empty contents, i.e. contents without child nodes.

Parse HTML in XML

Sometimes people put HTML into RSS feeds and the like, and it needs to be parsed. This is not a straightforward task, but this code should help you out. This is taken from the parser for a custom RSS feed parser Colin Dean wrote for NIH. See VO #1024for a little discussion and a full source parser example.
You may be inclined to follow some online tutorials about using two passes to parse: one outputs the HTML with output escaping disabled, and the second pass actually performs the intended parsing. One may think to use Velocity’s secondary parser in a source to handle this. This is incorrect! It can be done in one pass with some magical viv XSL extensions.


Uglify Text for Content Name Attribute

First normalizes (trims) whitespace. Converts spaces to dashes. Lowercases, and strips non-alpha-numeric (and dash) characters.

 

Copy and Modified Documents with a Watson Explorer Converter

A common task when crawling and indexing a document in Watson Explorer Engine (WEX) is making changes to a document during the conversion process. The most common occurrence is needing to copy all the contents in the application-vxml document while making some changes to one or a few of those contents. To do this, there is a recursive copy template that can be used. I’ll show you how to apply it.

First, I’m going to use the out-of-box “example-metadata” collection. Navigate to that collection and click the test-it button.

wex collection screenshot test it

After clicking test-it you will see a listing of documents. Click on the test-it button for the “blowout” record.

watson explorer test-it results

On the resulting page, scroll down and look at the conversion trace. There is a converter called “Create Metadata from Content”. This is the converter that ships with WEX to convert the HTML files into v:xml documents. Each of the links on the left side represent input and output of that conversion step. We want to click on the output of this converter to see what the document looks like.

watson explorer conversion trace

You will see the output of your current V:XML document. Note that I have a Google Chrome plugin that is converting my XML output for display.

watson explorer converter output

For the sake of this exercise, let’s change the title field to contain the actual title and the author. Like this: Blowout – Lucy Spring. To do this we go back to the previous page and click “add new converter” further down the page.
watson explorer add converter

We want a custom converter

watson explorer add custom converter

Now you will see the configuration screen for a custom converter
wex_converter_08You want to set both the type-in and type-out to application/vxml-unnormalized as we want to apply this template to application/vxml-unnormalized and we will provide application/vxml-unnormalized as output. I use “unnormalized” because I want the normal WEX normalization functions to still apply after this transformation. Also give your converter a name.

wex custom converter configuration

The next section is the conditional setting. This is where you can determine the matches that will cause the converter to apply. In this case we want to match all so I just add a wildcard (*).

wex converter conditional settings

You can skip the advanced section and focus on the Action section. First, the needs to be set to XSL since we’re applying an XSL template to an XML document.

watson explorer custom converter action

Now we’ll use a standard template that allows you to copy nodes with special processing.

The template above will only copy the document if you run it this way. We want to modify this to merge our title and author by matching on the title content and copying some things.

As you can see I’ve added comments in the code above. The important thing to note is that I want to modify the title content so I match it and the mode is always copy due to the way this template works. Then I just copy the attributes, and concat the two values I wanted.

Save this converter and click test-it again at the top of the Watson Explorer page. You will now see your new converter in the conversion trace.

wex custom converter conversion trace

Now if we check the input and output we’ll see the difference.

The before:

wex before converter

Now the title after:

wex converter after

Now if you crawl this collection your titles will include the author name in the search results.
watson explorer search results

The Cognitive Call Center on IBM Watson

One of the major use cases I see for Watson Explorer (WEX) is in call centers. No matter the specific industry the major goal for call centers are decreasing call time and increasing customer satisfaction rates. The way to do this is to get the correct information in front of the Customer Service Representative (CSR) as fast as possible. This is an excellent use case for Watson Explorer Application Builder (WEX AppBuilder). I’m going to keep this post mostly high level and not get too deep into the technical aspects of such solution.

watson explorer application builder screenshot
A simple example of a Watson Explorer Application Builder display.

WEX AppBuilder works together with Watson Explorer Engine (WEX Engine) to present the user with an 360 degree view of the information they need. If you think of Cognitive Computing as a pyramid, then WEX Foundational is the base layer of that pyramid. Watson Explorer Engine can crawl, convert, and index data into high speed positional indexes. Once the data is indexed in the WEX Engine, you can leverage other applications and APIs, both IBM and external, such as WEX AppBuilder or the IBM Watson Developer Cloud services.

For a call center scenario you would first index your data sources in Watson Explorer Engine. WEX Engine connects to all types of data sources, but if a connector doesn’t exist for your data source there are even push APIs available. The WEX Engine allows some industry leading search features and can even be used as a standalone enterprise search application.

Once the data is indexed, then you can build a pretty compelling user interface very fast in WEX AppBuilder. The WEX AppBuilder product allows you to quickly connect to engine data sources and build pages and widgets to display the data. There are several out of box widgets but it also allows for custom widgets using Ruby, JavaScript, HTML, and CSS.  I don’t want to give the idea that WEX AppBuilder is simply a display framework though. The product allows you to define entities and associations from your data. Those associations can be used to bring together related data into that unified 360 degree view. AppBuilder also has the concept of endpoints that allow you to connect to APIs in real time to bring in additional data, and allow other system to connect to AppBuilder to retrieve data.

Say your call center is caller centric. Note: That seems obvious but is not always the case. A call may have many products. Your CSR rep gets the critical information from the call, performs a search, and lands on a page that tells you all about the caller. The system will tell you all of the products they use, and don’t use. It may tell you if they are past due. It could also tell you if they had support tickets open recently. Most call center reps will need to navigation through several different systems to get this data currently. You can see how gathering all of this information into a single view can be beneficial rather quickly.

watson personality insights
An example IBM Watson Personality profile

Getting the call center onto a 360 view is the first step. The real power comes from having that data in the IBM Watson Explorer platform. You can then start taking the system in a cognitive direction. Instead simply displaying that data to the rep, what if the system was able to predict what the call was about? Using WEX Content Analytics and the Watson Developer Cloud you can start to put together those predictions. We can analyze data points such as mailers that were sent out, past due bills, reasons for calls, etc. You could analyze the call logs and help tickets to get a general sentiment of the client. Would you like to be able to predict if a client is about to change providers? Why wait for them to tell you about it. You could even use the Watson Personality Insights to build a profile of the client. This can give ideas of what types of communication turn them off.

When your CSRs are armed with this type of information it can really change the interaction. Imagine how you would feel if you called in and the CSR already knew about your issue. What if you got notified of the issue before you even thought to call about it? The system can even evolve more. You could use something like the Watson Conversation Service to handle some of these types of questions now before the client even has to speak to a person. This frees up your CSRs to handle higher value calls. When people do call the IVR system can pass the information to Watson Explorer and have the display populated for the CSR.

As you can see in this high-level view there is a lot of value to be gained from starting with the Watson Explorer platform. Getting your call center on the Watson Explorer platform is the first step in this cognitive journey.

If you’d like to learn more feel free to comment below or contact me.

 

IBM Watson Explorer

I’m going to talk a little bit about IBM Watson Explorer (WEX). A few people have contacted me about what I do at my day job as a Watson Explorer Consultant. Since this is my personal site I don’t usually focus on what I do at my 9-5. I’m going to write a few posts to explain what exactly it’s like working for the IBM Watson Group and what applications I work with.

What is IBM Watson Explorer?

IBM Watson Explorer is a data discovery tool. It allows you to explore vast amounts of enterprise data. The tool allows you to consume and index data from various data sources. Out of the box Watson Explorer ships with many popular connectors for enterprise data systems. Using its own proprietary indexing technology, Watson Explorer can leverage natural language processing to deliver relevant query results to end-users. The product can also utilize Query Routing to route queries to websites and return the results within its own interface. This data can be integrated into a single, 360-degree view, application on the front-end.

IBM Watson Explorer ships with several different modules:

  • IBM Watson Explorer Foundational Components
    • IBM Watson Explorer Engine
    • IBM Watson Explorer Results Module
    • IBM Watson Explorer Application Builder
  • IBM Watson Explorer Analytical Components
    • IBM Watson Explorer Content Analytics Admin Console
    • IBM Watson Explorer Content Analytics Miner
    • IBM Watson Explorer Content Analytics Search
    • IBM Watson Explorer Content Analytics Studio

IBM Watson Explorer Engine

The Watson Explorer Engine component is the key backend component of the foundational components. The foundational components come from IBM’s acquisition of a startup called Vivisimo based out of Pittsburgh, PA. Engine basically acts as an enterprise search engine that can be leveraged to crawl and indexed large amounts of data both structured and unstructured. The documents are stored as XML documents. During the crawling process, XSLT can be utilized to modify the data of the document before storing it to the index. Engine can be configured to be distributed among many servers to meet big data needs and scale quickly. The web based admin interface allows IT users a simple way to manage this powerful application. For enterprise search applications engine comes with its own search interface. To leverage 360 degree views engine must be combined with IBM Watson Explorer Application Builder.

ibm watson explorer engine
The Watson Explorer Engine admin interface

IBM Watson Explorer Results Module

The results module component allows non-technical business users to manage feature of the search results within Watson Explorer. Users can use the spotlight manager to configure spotlights that will show a boosted content above standard search results based on specific keywords. You can also use Results Module’s terminology manager to easily manage spelling suggesting, synonyms, and related terms.

IBM watson explorer results module
A screenshots of the IBM Watson Explorer Results Module interface

IBM Watson Explorer Application Builder

The Application Builder module is used to build 360 views of enterprise data. This applications connects to Watson Explorer Engine and displays the indexed data to the end user. One of the primary benefits of Application Builder is that you can leverage the entity model. By creating an entity for your data you can then define relationships which allows a developer to easily combine and display related data to the end user. Users can choose to follow specific entities that they are interested in. Application Builder will then provide them with the most relevant information based on what the user follows.

The technology behind Application Builder is Ruby on Rails. Specifically, App Builder uses jRuby which means that the application runs inside a JVM. So for IT purposed it can be considered as another Java application.

App Builder ships with several out of the box widgets that can be configured to display various types of data. Most deployments of Application Builder use many custom widgets. Custom widgets allow a developer to utilize Ruby (ERB), HTML, CSS and JavaScript to deliver a custom experience. This tool is very powerful and is currently being used by many enterprise customers to get the entire 360 view of their data so they can make educated business decisions.

ibm watson explorer app builder
An example of a “360 View” using IBM Watson Explorer Application Builder.

 IBM Watson Content Analytics

The content analytics module is a separate piece of software from the foundational components. IBM Watson Content Analytics allows you to collect and analyze different types of content. It stores this content in its own indices which are currently separate from Watson Explorer Engine. It can consume both structured and unstructured data from documents, email, databases, websites, and other enterprise repositories.  You can then perform text analytics across the data that is indexed by Watson Content Analytics.

What is a Watson Explorer Consultant

I’m a Watson Explorer Consultant. That means I work directly with customers to conceptualize and deliver Watson Explorer Solutions. My primary focus is on the Watson Explorer Engine foundational components. I use the engine, app builder, and results module components to deliver solutions to data problems at companies big and small. I’m currently one of the experts on my team for IBM Watson Explorer Application Builder. I’m able to utilize my past web development experience to deliver some highly customized solutions to customer data problems.

Our team is based out of Pittsburgh, PA but we are also distributed across the world. I currently work from my home in Ohio full time. I spend a portion of my time traveling to client sites to consult with them directly and deliver solutions in person. It takes a special kind of person to be able to handle problems with both technology and humans. If you’d like to reach out to me please use my contact form.