|
What makes most regular expressions difficult, is the fact that they're greedy by default. By making them ungreedy using the U modifier, writing something like following
ex:
get protocole, url and params from this string below
ftp://www.google.fr?toto=45&ret=fr 23 lmjjj jj hgh ffg
#(http|https|ftp)(?:://)(.*\.[a-zA-z0-9]{2,4})([s/?].*)\s#U
U (PCRE_UNGREEDY)
This modifier inverts the "greediness" of the quantifiers so that they are not greedy by default, but become greedy if followed by ?. It is not compatible with Perl. It can also be set by a (?U) modifier setting within the pattern or by a question mark behind a quantifier (e.g. .*?).
|
The following should be escaped if you are trying to match that character
\ ^ . $ | ( ) [ ] * + ? { } ,
Special Character Definitions
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
| Alternation
() Grouping
[] Character class
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
More Special Character Stuff
tab (HT, TAB)
newline (LF, NL)
return (CR)
\f form feed (FF)
\a alarm (bell) (BEL)
\e escape (think troff) (ESC)
octal char (think of a PDP-11)
hex char
\c[ control char
\l lowercase next char (think vi)
\u uppercase next char (think vi)
\L lowercase till \E (think vi)
\U uppercase till \E (think vi)
\E end case modification (think vi)
\Q quote (disable) pattern metacharacters till \E
Even More Special Characters
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-word character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
\b Match a word boundary
\B Match a non-(word boundary)
\A Match only at beginning of string
\Z Match only at end of string, or before newline at the end
\z Match only at end of string
\G Match only where previous m//g left off (works only with /g)
|
Les expressions rationnelles correspondent aux grammaires (voir ) de la hiérarchie de Chomsky ;
elles peuvent donc être utilisées pour décrire la morphologie d’une
langue.
Les notations utilisés sont très variables. Ce doc.
regroupe d'une part les notations propres à différentes
implémentations, et d'autre part, l'entreprise de
normalisation.
En particulier, le standard C++2011 reconnaît l'existence
de plusieurs moteurs d'expressions rationnelles :
ECMAScript : ECMA-262; basic : POSIX basique;
extended : POSIX étendu; awk : compatible awk;
grep : compatible "grep"; egrep : compatible
"extended grep".
Les exemples suivants ont tous des motifs encadrés avec des délimiteurs valides.
/foo bar/ , #^[^0-9]$# , +php+ , %[a-zA-Z0-9_-]%
Si le délimiteur doit être décrit dans le motif, il doit être échappé avec un backslash. Si le délimiteur apparait souvent dans le motif, choisir un autre délimiteur est une bonne idée pour en augmenter la lisibilité.
/http:\/\// , #http://#
La fonction preg_quote() peut être utilisée pour échapper une chaine et l'utiliser dans un motif. Son second paramètre optionnel sert à spécifier le délimiteur qui doit être échappé.
En complément des délimiteurs cités ci-dessus, il est aussi possible d'utiliser des accolades comme délimiteur où les accolades ouvrante et fermante sont respectivement les délimiteurs ouvrant et fermant.
{ceci est un motif}
Vous pouvez aussi utiliser des modificateurs de motif après le délimiteur final. L'exemple suivant montre une correspondance insensible à la casse.
#[a-z]#ix , /(.*)/is , ~([a-z0-9]{4,8})~im , ...
-
Pattern
Modifiers
The current possible PCRE modifiers are listed below. The
names in parentheses refer to internal PCRE names for
these modifiers. Spaces and newlines are ignored in
modifiers, other characters cause error.
-
i (PCRE_CASELESS)
-
If this modifier is set,
letters in the pattern match both upper and lower case
letters.
-
m (PCRE_MULTILINE)
-
By default, PCRE treats
the subject string as consisting of a single "line" of
characters (even if it actually contains several
newlines). The "start of line" metacharacter (^) matches
only at the start of the string, while the "end of line"
metacharacter ($) matches only at the end of the string,
or before a terminating newline (unless D modifier is set). This is the
same as Perl. When this
modifier is set, the "start of line" and "end of line"
constructs match immediately following or immediately
before any newline in the subject string, respectively,
as well as at the very start and end. This is equivalent
to Perl's /m modifier. If there are no "
" characters
in a subject string, or no occurrences of ^ or $ in a
pattern, setting this modifier has no effect.
-
s (PCRE_DOTALL)
-
If this modifier is set,
a dot metacharacter in the pattern matches all
characters, including newlines. Without it, newlines are
excluded. This modifier is equivalent to Perl's /s
modifier. A negative class such as [^a] always matches a
newline character, independent of the setting of this
modifier.
-
x (PCRE_EXTENDED)
-
If this modifier is set,
whitespace data characters in the pattern are totally
ignored except when escaped or inside a character class,
and characters between an unescaped # outside a
character class and the next newline character,
inclusive, are also ignored. This is equivalent to
Perl's /x modifier, and makes it possible to include
commentary inside complicated patterns. Note, however,
that this applies only to data characters. Whitespace
characters may never appear within special character
sequences in a pattern, for example within the sequence
(?( which introduces a conditional subpattern.
-
e (PREG_REPLACE_EVAL)
-
-
A (PCRE_ANCHORED)
-
If this modifier is set,
the pattern is forced to be "anchored", that is, it is
constrained to match only at the start of the string
which is being searched (the "subject string"). This
effect can also be achieved by appropriate constructs in
the pattern itself, which is the only way to do it in
Perl.
-
D (PCRE_DOLLAR_ENDONLY)
-
If this modifier is set,
a dollar metacharacter in the pattern matches only at
the end of the subject string. Without this modifier, a
dollar also matches immediately before the final
character if it is a newline (but not before any other
newlines). This modifier is ignored if m modifier is set. There is no
equivalent to this modifier in Perl.
-
S
-
When a pattern is going
to be used several times, it is worth spending more time
analyzing it in order to speed up the time taken for
matching. If this modifier is set, then this extra
analysis is performed. At present, studying a pattern is
useful only for non-anchored patterns that do not have a
single fixed starting character.
-
U (PCRE_UNGREEDY)
-
This modifier inverts the
"greediness" of the quantifiers so that they are not
greedy by default, but become greedy if followed by ?.
It is not compatible with Perl. It can also be set by a
(?U) modifier setting within the pattern
or by a question mark behind a quantifier (e.g. .*?).
Note: It is
usually not possible to match more than pcre.backtrack_limit characters in
ungreedy mode.
-
X (PCRE_EXTRA)
-
This modifier turns on
additional functionality of PCRE that is incompatible
with Perl. Any backslash in a pattern that is followed
by a letter that has no special meaning causes an error,
thus reserving these combinations for future expansion.
By default, as in Perl, a backslash followed by a letter
with no special meaning is treated as a literal. There
are at present no other features controlled by this
modifier.
-
J (PCRE_INFO_JCHANGED)
-
The (?J) internal option
setting changes the local PCRE_DUPNAMES
option. Allow duplicate names for subpatterns.
-
u (PCRE_UTF8)
-
This modifier turns on
additional functionality of PCRE that is incompatible
with Perl. Pattern strings are treated as UTF-8. This
modifier is available from PHP 4.1.0 or greater on Unix
and from PHP 4.2.3 on win32. UTF-8 validity of the
pattern is checked since PHP 4.3.5.
|