Monday, October 13, 2008

Use Regular Expressions in SQL

A feature introduced in Oracle Database 10g vastly improves your ability to search and manipulate character data. This feature, regular expressions, is a notation for describing textual patterns. It has long been available in many programming languages and a number of UNIX utilities and now available in Oracle. Oracle's implementation of regular expressions comes in the form of various SQL functions and a WHERE clause operator.

What Is a Regular Expression?

A regular expression comprises one or more character literals and/or metacharacters. In its simplest format, a regular expression can consist only of character literals, such as the regular expression cat. It is read as the letter c followed by the letters a and t and this pattern matches strings such as cat. When you understand the meaning of the various metacharacters, you will see that regular expressions are powerful for isolating and replacing specific textual data. You can apply them in order to validate the formats of phone numbers, zip codes, email addresses, Social Security numbers, IP addresses, filenames and pathnames, and so on. Furthermore, you can locate patterns such as HTML tags, numbers, dates, or anything that fits any pattern within any textual data and replace them with other patterns.

To learn Regular Expression you can go here


Using Regular Expressions With Oracle Database 10g

To utilize the power of regular expressions, you can exploit the newly introduced Oracle SQL REGEXP_LIKE operator and the REGEXP_INSTR, REGEXP_SUBSTR, and REGEXP_REPLACE functions. You will see how this new functionality supplements the existing LIKE operator and the INSTR, SUBSTR, and REPLACE functions. In fact, they are similar to the existing operator and functions but now offer powerful pattern-matching capabilities. The searched data can be simple strings or large volumes of text stored in the database character columns. Regular expressions let you search, replace, and validate data in ways you have never thought of before, with a high degree of flexibility.


The REGEXP_LIKE Operator

The following SQL query's WHERE clause shows the REGEXP_LIKE operator, which searches the ZIP column for a pattern that satisfies the regular expression [^[:digit:]]. It will retrieve those rows in the ZIPCODE table for which the ZIP column values contain any character that is not a numeric digit.

SQL> SELECT zipFROM zipcode WHERE REGEXP_LIKE(zip, '[^[:digit:]]');

ZIP

-----
ab123
123xy
007ab
abcxy


This example of a regular expression consists only of metacharacter more specifically the POSIX character class digit delimited by colons and square brackets. The second set of brackets (as in [^[:digit:]]) encloses a character class list. As previously mentioned, this is required because you can use POSIX character classes only for constructing a character list.

For More Understanding go Here


The REGEXP_INSTR Function

This function returns the starting position of a pattern, so it works much like the familiar INSTR function. The main difference between the two functions is that REGEXP_INSTR lets you specify a pattern instead of a specific search string; thus providing greater versatility. The next example uses REGEXP_INSTR to return the starting position of the five-digit zip-code pattern within the string Joe Smith, 10045 Berry Lane, San Joseph, CA 91234. If the regular expression is written as [[:digit:]]{5}, you will get the house number's starting position instead of the zip code's, because 10045 is the first occurrence of five consecutive digits. Therefore, you must anchor the expression to the end of the line, as indicated with the $ metacharacter, and the function will display the starting position of the zip code regardless of the number of digits for the house number.


SQL> SELECT REGEXP_INSTR('Joe Smith, 10045 Berry Lane, San Joseph, CA 91234','[[:digit:]]{5}$') AS rx_instr FROM dual;

RX_INSTR

----------
45

Writing More Complex Patterns Let's expand on the zip code pattern of the previous example to include an optional four digits. Your pattern may now look like this: [[:digit:]]{5}(-[[:digit:]]{4})?$. If your source string ends in either the 5-digit zip code or the 5-digit + 4 zip-code format, you'll be able to show the pattern's starting position.

SQL> SELECT REGEXP_INSTR('Joe Smith, 10045 Berry Lane, San Joseph, CA 91234-1234', ' [[:digit:]]{5}(-[[:digit:]]{4})?$') AS starts_at FROM dual;

STARTS_AT

----------
44

In this example the parenthesized subexpression (-[[:digit:]]{4}) is repeated zero or one times, as indicated by the ? repetition operator. Again, attempting to use the traditional SQL functions to accomplish the same result poses a challenge even to SQL experts. To better explain the various components of the regular expression please go Here


The REGEXP_SUBSTRFunction


The REGEXP_SUBSTR function, much like the SUBSTR function, extracts part of a string. In the following example, the string that matches the pattern , [^,]*, is returned. The regular expression searches for a comma followed by a space; then zero or more characters that are not commas, as indicated by [^,]*; and lastly looks for another comma. The pattern will look somewhat similar to a comma-separated values string.

SQL> SELECT REGEXP_SUBSTR('first field, second field , third field',', [^,]*,') FROM dual;

REGEXP_SUBSTR('FIR
------------------
, second field ,


The REGEXP_REPLACE Function

Let's first look at the traditional REPLACE SQL function, which substitutes one string with another. Assume your data has extraneous spaces in the text and you would like to replace them with a single space. With the REPLACE function, you need to list exactly how many spaces you want to replace. However, the number of extra spaces may not be the same everywhere in the text. The next example has three spaces between Joe and Smith. The REPLACE function's parameter specifies that two spaces should be replaced with one space. In this case, the result leaves an extra space where there were three spaces in the original string between Joe and Smith.

SQL> SELECT REPLACE('Joe Smith',' ', ' ')AS replace FROM dual;

REPLACE

---------
Joe Smith

The REGEXP_REPLACE function takes the substitution a step further. The following query replaces any two or more spaces with a single space. The ( ) subexpression contains a single space, which can be repeated two or more times, as indicated by {2,}.

SQL> SELECT REGEXP_REPLACE('Joe Smith','( ){2,}', ' ') AS RX_REPLACE FROM dual;

RX_REPLACE
----------
Joe Smith


For More Understanding go Here

No comments: