Here I will demonstrate how to index content of CLOB data using Oracle Text. Other than the database objects used in the example, the post addresses the use of Oracle Text in a general way, and will work for any CLOB column. There is no ground breaking code here, but it completes the example on how to handle Microsoft Open Office XML Format (DOCX) documents. The examples and content here are based on
my previous post.
The code here is tested with Oracle database release 11.1.0.6, but should work with releases down to 10g. When I say "tested", remember I am a developer, so it basically means it does compile and it did give the expected result on
one run trough...
See it in ActionI have included a search page in my demo application. Upload a DOCX document, and try it for yourself:
Go to demo application.
About Oracle TextOracle Text (formerly Oracle interMedia) is Oracles full-text retrieval technology, and part of Standard, Enterprise and even XE editions of the database. See
Oracle Text on oracle.com for more information.
In my case I will use it to index text content stored in a CLOB column, and utilize it's search capabilities to enable users to retrieve relevant results based on search phrases.
Indexing CLOBsOr BLOBs, or BFILEs, for that matter. It is very easy to get up and running with a text index. The following code is all that is needed to get a text index up and running against the TEXT_CONTENT column of the SAMPLES_DOCX table:
create index samples_docx_ctx on samples_docx
(text_content)
indextype is ctxsys.context
/
That's it! If you want to get fancy, there are a number of options and ways to enhance the indexing, look at
the documentation for more information on the subject (lexers, sections, stopword lists, etc.).
Searching with Oracle TextWith the text index in place, the next step is to use the index to retrieve search results. In my example I will use a technique called "Progressive Relaxation" as described by Roger Ford
here. It uses a query template to implement a progressive relaxation of the search tokens.
The following script creates a package to handle and use the new text index:
create or replace package samples_docx_search_p
as
-- result types
type t_result_rec is record (
score number,
filename samples_docx.filename%type,
snippet varchar2(4000));
type t_result_tab is table of t_result_rec;
-- synchronize search index
procedure sync_search_index (
p_ctx_index_name in varchar2 default 'SAMPLES_DOCX_CTX'
);
-- search docx content
function search_text (
p_tokens in varchar2
) return t_result_tab pipelined;
end;
/
create or replace package body samples_docx_search_p
as
-- synchronize search index
procedure sync_search_index (
p_ctx_index_name in varchar2 default 'SAMPLES_DOCX_CTX'
) is
begin
ctx_ddl.sync_index(p_ctx_index_name, '2M');
end;
-- search docx content
function search_text (
p_tokens in varchar2
) return t_result_tab pipelined
as
l_max_rows integer := 10;
l_counter integer := 0;
l_ret_rec t_result_rec;
l_tokens varchar2(4000) := lower(p_tokens);
l_query_template varchar2(32000):='<query>'
||chr(10)||' <textquery> heregoesthetokens'
||chr(10)||' <progression>'
||chr(10)||' <seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>'
||chr(10)||' <seq><rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite>/seq>'
||chr(10)||' <seq><rewrite>transform((TOKENS, "?{", "}", "AND"))</rewrite></seq>'
||chr(10)||' <seq><rewrite>transform((TOKENS, "{", "}", "OR"))</rewrite></seq>'
||chr(10)||' <seq><rewrite>transform((TOKENS, "?{", "}", "OR"))</rewrite></seq>'
||chr(10)||' </progression>'
||chr(10)||' </textquery>'
||chr(10)||'</query>';
l_query varchar2(32000);
begin
if l_tokens is null
then
return;
end if;
-- create query temlpate based on tokens
l_query_template := replace(l_query_template, 'heregoesthetokens', l_tokens);
-- restructure tokens for use with snippet
l_tokens := replace(l_tokens, ' ', ' | ');
for c in (select rowid
, filename
from samples_docx
where contains (text_content, l_query_template, 1) > 0)
loop
-- crude custom score, just listing the results as they come
l_ret_rec.score := l_counter;
l_ret_rec.filename := c.filename;
-- create snippet (with tokenbased highlight) of content to return to the user
ctx_doc.set_key_type('ROWID');
l_ret_rec.snippet := ctx_doc.snippet('SAMPLES_DOCX_CTX', c.rowid, l_tokens, '<b>', '</b>', false);
l_counter := l_counter + 1;
-- return row
pipe row (l_ret_rec);
-- exit when max number of rows exceeded
exit when l_counter >= l_max_rows;
end loop;
end;
end;
/
Some parts of the code are more cryptic than others.
First up is the "SYNC_SEARCH_INDEX" procedure. It's sole purpose is to synchronize the search index with the TEXT_CONTENT column. When you perform DML on the TEXT_CONTENT column, changes to the text index does not automatically propagate to the search index (unless the index is explicitly told to do so). So, in this example, if you insert a new row, the TEXT_CONTENT will not show up through Oracle Text searches until you have told it to synchronize the index. Indexing can be a resource demanding operation, so it makes sense to separate table DML from the indexing job.
In short:
Make sure you synchronize the text index after DML on the indexed column has been performed.
Next is the "SEARCH_TEXT" function, particularly the query template. In the past I have used Oracle Text in its most simple form, but query templates adds a new dimension to controlling your searches. Basically you define how you want to use the search tokens (search phrase) to find matches in the text index. It is written in XML-form, and the <seq>-tag states the order of which you want to process your criteria. For more information on query templates, take a look at
the documentation.
A short explanation on the query template:
- <rewrite>transform((TOKENS, "{", "}", " "))</rewrite>: Matches when whole phrase (tokens in order) is present within document
- <rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite>: Matches when each token is present within document
- <rewrite>transform((TOKENS, "?{", "}", "AND"))</rewrite>: Matches when each token is present within docoument, allowing typos (fussy search)
- <rewrite>transform((TOKENS, "{", "}", "OR"))</rewrite>: Matches when any of the tokens are present within document
- <rewrite>transform((TOKENS, "?{", "}", "OR"))</rewrite>: Matches when any of the tokens are present within document, allowing typos (fussy search)
Although it will return matches according to the sequence, the "score" within the sequence will be random (as far as I could see at least), so the "score" column in the example above needs a bit of work. I also learned the hard way that structure of the XML in the query template is not validated, and it swallows typos without a word.
Test the CodeYou can test the code directly from SQL*Plus or your favorite weapon of choice. The following code will search for "test":
select filename
, snippet
from table(samples_docx_search_p.search_text('test'))
order by score
/
WildcardsMy query template does not support searching for parts of words, there are some reasons for this behavior.
- Wildcard searches requires more resources
- The results returned will be unpredictable. Imagine searching for "ora*", both "oracle" and "orange" will match.
- Wildcard searches are more suitable for look-ups in structured data
What did I mean by that last statement? As I see it, users today are using a wide range of search engines and site searches every day, and wildcard matching are in most cases available only through the "advanced search"-link (that is rarely used, if ever). Let alone that people do not expect this kind of behavior anymore. If you want to help the user, add ajax autocomplete to your search field, and use the Oracle Text index. In the traditional back office application, it is another matter (look-ups and such).
Create a Search Page in Oracle APEXI will not do a detailed explanation here, but creating a report based on a query will do the trick. If you choose "Yes" for "Enable search" in the wizard it will save you some work by creating the search dialog on the page. Clean up the query in the report region so that tour query looks something like this:
select filename
, snippet
from table(samples_docx_search_p.search_text(:PXX_REPORT_SEARCH))
order by score
When keeping the order by, I assume that you will let the score column determine the result order. For good measure you can create a row template for the result representation in the APEX report so you can have heading and snippet spread over two rows.
Update Sample Upload PageIf you created the upload page described in my previous post, include a new page process (After Submit, and after the "extract_text"-process). The process should be of PL/SQL-type, and the code should look something like this:
samples_docx_search_p.sync_search_index;
It does not necessary have to be a page process, but it is important to synchronize the text index after DML.
How to Handle DOC and DOCX...And PDF, XLS, PPT, etc. Well, the DOCX is the exception here. Oracle Text does not understand the format yet (unless you are running on release 11.1.0.7 or higher). If you create an Oracle Text index on a BLOB column containing other file types which it does understand, they will be indexed automagically. If you want to index the content of a DOCX document in the same index, use the technique described in my last post, convert the output (plain text) into a BLOB, and insert the BLOB into the table with the other documents. Or said in a different way: you except and handle DOCX documents differently than the known document types.
Querying an Oracle Text index works the same for BLOB and CLOB columns.
There is definitely room for improvement in the code, but it serves well enough for demonstration purposes.
Enjoy :-)