EReg issue

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

EReg issue

Juan Delgado
Hi there,

Sorry, because this is most likely me being thick while using regular
expressions. I'm trying to extract some tables out of a HTML page (i
know, i know, regexps and HTML!), but when I try something like this:

var tables = ~/<table class\=\"wadus\".*<\/table>/s;

It matches the first <table> tag with the *last* </table> tag in the
page. How can I match the first <table> tag with its own </table> tag
(which is obviously the first </table> occurrence after <table>)? So:

<table class="wadus">
        ....
        // these bits i'm interested in
</table>
...
// but i get up until here!
</table>


Also, how can I iterate over all the matches? Let's say a regexp
matches 10 times in a string, it seems there's no way to get an array
/ list / iterator of the bits that match your regexp? Do I have to
create a while loop, use matched(index) and catch the exception when
there are no more matches to break the loop? Tried that, but I think
it's giving me weird results.

Any ideas?

Cheers!

Juan

--
Juan Delgado - Zárate
http://zarate.tv
http://blog.zarate.tv

--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: EReg issue

Simon Krajewski
Am 17.06.2011 10:54, schrieb Juan Delgado:

> Hi there,
>
> Sorry, because this is most likely me being thick while using regular
> expressions. I'm trying to extract some tables out of a HTML page (i
> know, i know, regexps and HTML!), but when I try something like this:
>
> var tables = ~/<table class\=\"wadus\".*<\/table>/s;
>
> It matches the first<table>  tag with the *last*</table>  tag in the
> page. How can I match the first<table>  tag with its own</table>  tag
> (which is obviously the first</table>  occurrence after<table>)? So:
>
> <table class="wadus">
> ....
> // these bits i'm interested in
> </table>
> ...
> // but i get up until here!
> </table>

The keyword for your problem is "greedy": The standard * is greedy in
that it matches as much as it can. You usually use *? to indicate
non-greedy matching, but IIRC there were some issues with that in at
least one of the target platforms.

> Also, how can I iterate over all the matches? Let's say a regexp
> matches 10 times in a string, it seems there's no way to get an array
> / list / iterator of the bits that match your regexp? Do I have to
> create a while loop, use matched(index) and catch the exception when
> there are no more matches to break the loop? Tried that, but I think
> it's giving me weird results.

I never tried this in haxe before, but the EReg API has matchedRight
which "Returns the part of the string that was as the right of of the
matched substring." [1] You should be able to loop over a sequence of
match, matched and matchedRight until matchedRight is null or empty
string (not sure which one).

Regards
Simon

[1] http://haxe.org/api/ereg


--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: EReg issue

Andreas Mokros
In reply to this post by Juan Delgado
Hi.

On Fri, 17 Jun 2011 09:54:06 +0100
Juan Delgado <[hidden email]> wrote:
> var tables = ~/<table class\=\"wadus\".*<\/table>/s;

What is /s for?

> How can I match the first <table> tag with its own </table> tag
> (which is obviously the first </table> occurrence after <table>)?

I'd do something like:
var rgx = ~/<table class="wadus">([^<]*)<\/table>/;
That matches all characters that are not "<".

> Also, how can I iterate over all the matches?

You could (mis)use customReplace for that:
var str = '
<table class="wadus">
This is a table content
</table>
<table class="wadus">
This is another table content
</table>
';
var list = new List();
var rgx = ~/<table class="wadus">([^<]*)<\/table>/;
rgx.customReplace(str, function(r){
  list.add(r.matched(1));
  return null;
});
trace(list);

--
Mockey

--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: EReg issue

Andreas Mokros
In reply to this post by Simon Krajewski
Hi.

On Fri, 17 Jun 2011 12:08:36 +0200
Simon Krajewski <[hidden email]> wrote:
> the EReg API has matchedRight
> which "Returns the part of the string that was as the right of of the
> matched substring."

Ah, right. Forgot about matchedRight.
So this might be a better way:

var rgx = ~/<table class="wadus">([^<]*)<\/table>/;
while (rgx.match(str)) {
  trace(rgx.matched(1));
  str = rgx.matchedRight();
}

--
Mockey

--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: EReg issue

clemos
In reply to this post by Andreas Mokros
Hi,

> var rgx = ~/<table class="wadus">([^<]*)<\/table>/;
> That matches all characters that are not "<".

The thing is, you probably want to get the <tr>, <td>, ...
If it's about parsing HTML, maybe you'd better use haxe XML ?
If not possible and you still want inner tags (<tr>, <td>, ...), then
you'll probably need two operation, something like that (untested):

var parts = new List();
var tables = new List();

~/<table class="wadus">(.*)$/g.customReplace( str , function(r){
   parts.push( r.matched(0) );
   return null;
});
for( p in parts ){
  var rgx = ~/^(.*)</table>/;
  if( rgx.match( p ) ){
    tables.push( rgx.matched(0) );
  }
}

Cheers,
Clément

On Fri, Jun 17, 2011 at 12:34 PM, Andreas Mokros <[hidden email]> wrote:

> Hi.
>
> On Fri, 17 Jun 2011 09:54:06 +0100
> Juan Delgado <[hidden email]> wrote:
>> var tables = ~/<table class\=\"wadus\".*<\/table>/s;
>
> What is /s for?
>
>> How can I match the first <table> tag with its own </table> tag
>> (which is obviously the first </table> occurrence after <table>)?
>
> I'd do something like:
> var rgx = ~/<table class="wadus">([^<]*)<\/table>/;
> That matches all characters that are not "<".
>
>> Also, how can I iterate over all the matches?
>
> You could (mis)use customReplace for that:
> var str = '
> <table class="wadus">
> This is a table content
> </table>
> <table class="wadus">
> This is another table content
> </table>
> ';
> var list = new List();
> var rgx = ~/<table class="wadus">([^<]*)<\/table>/;
> rgx.customReplace(str, function(r){
>  list.add(r.matched(1));
>  return null;
> });
> trace(list);
>
> --
> Mockey
>
> --
> haXe - an open source web programming language
> http://haxe.org
>

--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: EReg issue

Andreas Mokros
Hi.

On Fri, 17 Jun 2011 14:35:54 +0200
clemos <[hidden email]> wrote:
> The thing is, you probably want to get the <tr>, <td>, ...

Hmm, yeah, completely forgot these here :-)
As Simon already pointed out, there is *? for non-greedy matching. At
least in neko (which is using PCRE) this works fine:

var rgx = ~/<table class="wadus">(.*?)<\/table>/s;
while (rgx.match(str)) {
  trace(rgx.matched(1));
  str = rgx.matchedRight();
}

/s flag is for matching "." over new lines (found that out meanwhile)

--
Mockey

--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: EReg issue

Juan Delgado
Thanks guys, the greedy bit did the trick!

About how to get all the matches, am i the only one missing the
iterator? The while loop trick works, but is not the most obvious
thing.

You guys think it's worth raising a feature request? Could there be
cross-platform issues?

J

On Fri, Jun 17, 2011 at 2:03 PM, Andreas Mokros <[hidden email]> wrote:

> Hi.
>
> On Fri, 17 Jun 2011 14:35:54 +0200
> clemos <[hidden email]> wrote:
>> The thing is, you probably want to get the <tr>, <td>, ...
>
> Hmm, yeah, completely forgot these here :-)
> As Simon already pointed out, there is *? for non-greedy matching. At
> least in neko (which is using PCRE) this works fine:
>
> var rgx = ~/<table class="wadus">(.*?)<\/table>/s;
> while (rgx.match(str)) {
>  trace(rgx.matched(1));
>  str = rgx.matchedRight();
> }
>
> /s flag is for matching "." over new lines (found that out meanwhile)
>
> --
> Mockey
>
> --
> haXe - an open source web programming language
> http://haxe.org
>



--
Juan Delgado - Zárate
http://zarate.tv
http://blog.zarate.tv

--
haXe - an open source web programming language
http://haxe.org