Wednesday, February 15, 2006

Natural Language Searching

How Natural Language Search Engines Work (based on my articles on Westlaw’s WIN; aging but basically still sound)
By Betsy McKenzie
1. Make a Question into a Query by filtering out introductory clauses (“how can I find” or “what is...” and leave the concept terms for the search. There are now many other stop lists as well, all designed to allow the search system to sort out human chat and select concept terms for a query. There is also a list of commonly misspelled words (like judgement to judgment) and common phrases (like “underground storage tank” or “Saturday night special”) stored. This does not yet amount to artificial intelligence, but an increasingly sophisticated tool to translate human speech patterns into a workable query.

2. Beyond Wild Cards: The Natural Language systems do what the Terms and Connector searches do, automatically creating a plural form of any noun in your query. But they go far beyond this step. They will create a single form if you enter a plural. And they will stem verbs in a most sophisticated way: “design” will generate [designing, designs, designed] but not
“designate!” which is pretty darned sophisticated. The program will also work with irregular verbs, so that “sell” will generate “sold.”

3. Specialized Features: Westlaw’s WIN recognizes West Key numbers. I am sure both Lexis and Westlaw recognize legal citations in whole and in part. Westlaw tells me that they give partial credit for a partial citation or phrase.

Westlaw for many years had stop lists that would refuse to search common words such as “of” and “at.” This meant you could not search “at will” or “statute of limitations.” Now they have made a rule allowing you to search such common stop words if they are the first of last word in a phrase, so you can now search both. The trick for “Statute of limitations” or other phrases with a stop word inside, is to break the phrase so the stop word is at the end or beginning: Statute +1 “of limitations.” This will work in Terms & Connectors as well as Natural Language in Westlaw. This might still cause problems in calculating the relevancy if the program counted every occurrence of the stop word “at” or “of” anywhere in the ocument. Westlaw’s WIN program was modified to only count these stop words when they occur within the designated legal phrase.


The Thesaurus feature has to be selected in Westlaw by the user. That means it is underused. It is a great feature, and you should look at it and decide about how useful it is for each search. At least, it can give you ideas for search terms. (Actually, this was written a while ago, and I think they are pushing the Thesaurus more to the forefront. It is a great tool, and you should look for ways to use it).


WIN functions by scoring words for uniqueness based on frequency of appearance in the whole Westlaw database. An infrequent term scores higher, presumptively being more specific or unique. Thus a clever researcher might strive for a query including lesser used terms, and then use the thesaurus function to broaden the query. But the success of a query depends, in Natural Language searching as in Boolean, upon the researcher being able to guess the terms that will appear in the ideal document.


Once the query is formed, the WIN system searches the concordance, the list of all words occurring in the selected database, just as the Terms and Connectors system does. But while a Boolean search looks for every occurrence of the specified terms in the specified relationship, the Natural Language search performs a statistical analysis of the words in each document. This analysis in WIN is based on:

1) Frequency of each term in the database as a whole -- the more unusual term will score higher.
2) Weight assigned each term based on frequency, proximity, length of the document(a smaller document with the same terms is a more concentrated focus, and scores higher);
3) Score of the document itself;
4) Score of portions of each document. This looks for clusters of the query terms.

Clusters are defined as multiple occurrences of terms surrounded by a “desert” of forty or more words with no terms. This is unique to Westlaw’s Natural Language search system. It allows a more sophisticated search that will find and score high relevance for the document that has one paragraph out of twenty that is exactly on point for my search. Without this portion score, such a document, which is probably of great interest to me, would score much lower in relevance and might not appear or be looked at. The program blends the portion score with the document’s overall score. This also avoids accidentally returning as very relevant those documents that discuss half the terms in an early part of the document and half in an unrelated later portion.

Getting the Most From WIN
WIN does not require all sets of terms in the query in order to return results. It will only return “no matching documents” if there are absolutely no documents in the database with any of the terms, or stems, or searcher-selected synonyms of terms. To compare with a Terms and Connectors search, <(squirtle charmeleon) & pokemon /p copyright> returns no documents. Running the same search in Natural Language squirtle or charmeleon in a case on pokemon copyright> does give the warning that the first three terms do not exist in the database (yet!), but will still return the maximum number of cases that contain the word “copyright” alone.

WIN allows the user to retrieve up to one hundred documents. This number that can be changed from the twenty document default by using the password options menu, or changing the number on the query screen. The user can also select whether to sort retrieved documents by relevance or by chronological order, by using the browse menu. The user can still display the relevance ranking even after selecting the age rank display. This can allow a clever searcher to use a combination of recency and relevance to select which documents to look at first. It is suggested that a searcher look at the first few documents retrieved in term mode, then switch to best selection under the display mode. This avoids the most relevant section from skewing the entire relevance list by a single heavy occurrence of terms.

As a user, I can select “control concepts” to require the occurrence of a term or its
equivalents. I can use this also to exclude documents where a particular term occurs. Exclusion is a powerful and potentially dangerous tool, but it can also be important. It might drop a document you want if the excluded term appears elsewhere in the document from the discussion on which you are focusing. It is recommended to use control concepts only after a first search has been unsuccessful -- either too large or unfocused.

There are a few other tricks to help focus more tightly once search results come in. Even with a WIN search, I can still use the “locate” command to perform a terms and connectors search within my original search results. I can also change the maximum results that will be returned, and add restrictions on the search at the query screen to help ensure the search returns the kind of documents I want. In WIN, I can still restrict the search by date or jurisdiction, just as I can in Terms and Connectors mode. This is a very powerful sorting tool.

To create the best Natural Language search, the searcher should try to succinctly state the issue in terms that will most likely appear in the documents desired. As with Terms and Connectors searches, results depend on the searcher’s ability to guess the language the court or other producer will use in the document of her dreams. The more specialized the area, the more terms of art and unique language can be used, the more likely the searcher will get the results she wants. A little background reading or familiarity with the area can help with WIN searches just as with our old Boolean searches.

SO, WHEN IS NATURAL LANGUAGE BETTER THAN BOOLEAN?

When You Don’t Need More than A Start

How often do we just need to find one case, article or statute on point, and then other tools such as citators will fill out the research as far as we really need it? My experience in the practice of law was that, by finding even one case on point, I had enough to spin out, through Shepards or KeyCite, and careful reading of the case’s cites, to complete the research adequately for the narrow uses of legal practice. Unless one is doing a major appeal in a big-money case, lawyers don’t try to be comprehensive in the sense that law school professors, law reviews and librarians do. As Michael Lynch points out, the facts of a client’s case generally demand fairly narrow research, both in factual and in jurisdictional terms.1


When Researching a Concept, Not Looking for a Particular Document

Natural Language searches may be an advantage when getting an introductory overview of an area, or looking at a broad concept rather than a specific topic.2 Academic representatives have suggested a Natural Language search for an initial search on moot court problems, to get an overview. It might be helpful in a law review or texts database to search for background in an area before composing the more focused search for primary law.

This, of course, depends on the access at a firm or library. Law students and increasing numbers of law firms have flat rate contracts, and if such a database is included in the flat rate, this might be a useable plan for firm research as well as law school research (though the flat rate pricing may rise with the increased use). This may also apply when the researcher is so new to a field of law that she doesn’t know the language, or the issue is complex, and one can’t guess what terms are needed.3


When There Are Too Many Results in Terms & Connectors

Because WIN allows for search results to be organized in relevance order, it may be preferable when Terms and Connectors searches retrieve too many results. Of course, the researcher should not overlook other control techniques, such as field searching with synopsis/digest or date fields, selecting a narrower database, or creating a narrower Boolean search. But if all these techniques still return too many unfocused or marginally relevant items, try a Natural Language search. Because Terms and Connectors will find documents where the terms are mere dicta or footnotes, the relevancy analysis of WIN may solve the problem. It can often pull the most relevant needles out of the haystack.4 Then, having found one case or article really on point, the searcher can broaden the research again with topic & key number and KeyCite or Shepard’s.


Searcher Has Trouble With or Is Unsure About Connectors

If a Terms and Connectors search is unsuccessful, or the searcher is unsure how the terms might relate to each other in the documents desired, try a Natural Language search. WIN can also be used to supplement a Boolean search to be extra thorough.5 Some years ago, I was advised by a Westlaw Reference Attorney that WIN was able to pull up a handful of relevant cases where the issue was liability for a baseball fan hit ball a foul ball. The Reference Attorney told me he had tried many different Terms and Connector searches unsuccessfully before turning to the WIN search that worked. I tried the searches myself, and had the same sort of experience. The problem seemed to be that connectors were either too narrow or too broad. Judges writing on this issue used too many configurations of the terms, and a Boolean search just was not successful for this topic.


Searching Law Reviews or Long Documents Without Fields


In case law databases, the synopsis/digest field search and Westlaw key number searches narrow the search. But in a law review database, the articles are very lengthy, and have no synopsis or digest field, and no key numbers. A WIN search can somewhat duplicate those narrow focus tools in such a database.6 By scoring documents by the number of times search terms appear in clusters, WIN looks for concentrations of the query terms, and then sorts the results for relevancy. WIN then lists the most statistically relevant documents first.

1 Lynch, An Impossible Task But Everybody Has to Do It - Teaching Legal Research in Law Schools, 89 Law Libr.J. 415, at 419-420, Summer, 1997.
2Conversation with Bill Bemish, West Group, Sept. 14, 1999 about WIN. See also,
“Weekly Research Tip” column , visited 2/1/00.
3“Weekly Research Tip” column , visited 2/1/00.
4Telephone conversation with Bill Bemish, 9/14/99.
5“Weekly Research Tip” column , visited 2/1/00.
6Telephone conversation with Bill Bemish, 9/14/99.

1 comment:

Betsy McKenzie said...

By the way, our academic rep from Westlaw believes that natural language searching works better for statutes & regs because they may be one long, run-on sentence, with multiple clauses, making it difficult to control for proximity with a terms & connectors search. I personally think statutes & regs are very hard to search online, with or without natural language. Partly because of the insane clauses & sentence structure, and partly because statutes use strange vocabulary that can be difficult to guess.