Topic: Regular expression

I am parsing files, but I only want to parse them if they are in a certain format. This is what they are supposed to look like:

X4640_oraldap1-sp_2.1.2.11.0.0_led_today.out

It starts with a letter, any letter, then can be any amount of any numbers 0-9. Then an underscore. Then any amount of any characters a-z until a dash. Then any amount of characters a-z until an underscore. Then any amount of any characters until the next underscore. Then any amount of characters a-z until another underscore, and so on until it's finished.

This is what I have so far the 'i' after the code makes it case insensitive:

/\A[a-z][0-9]*[_].*[-][a-z]*[_].*[_]/i

It's not finished because I'm stuck with the underscore right after the long number separated by periods. It matches perfectly until the underscore after that number but instead of stopping at the underscore after the number before 'led', it continues on until the next underscore and stops there?

X4640_oraldap1-sp_2.1.2.11.0.0_led_today.out
                                                 
                                                 
/\A[a-z][0-9]*[_].*[-][a-z]*[_].*[_]/i

The red code should have every character after the '-sp_' and up until the next underscore (blue underscore), but instead it captures everything up until the underscore after 'led' (red underscore). I'm not sure why it won't stop at the first underscore in it's path, but it doesn't, it stops at the second one. Why is this? I want it to stop at the first underscore. How can I change this?

Sorry that's probably confusing, maybe the best way to see what I mean is to do this:

go to this regexp editor and copy and paste this reg exp:

/\A[a-z][0-9]*[_].*[-][a-z]*[_].*[_]/i

and copy and paste the string to match:

X4640_oraldap1-sp_2.1.2.11.0.0_led_today.out

I hope you can understand that. Thanks in advance.

Last edited by RailsRhino (2010-07-20 16:18:20)

- Ben

Re: Regular expression

It does not stop at the first underscore because if the regexp in ruby matches the string, it chooses the longest match possible, so to stop it at that underscore you need '[^_]' instead of '.'

Re: Regular expression

hmm that didn't work.. I think you might have thought I was asking something else. Here is the whole name:

X4540_oraldap1-sp_1.0.1.11.0.0_led_today.out

Here is the first part of the regular expression that I have working correctly:
/\A[a-z][0-9]*[_].*[-][a-z]*[_]/i

the above matches this:
X4540_oraldap1-sp_

The next piece of the regular expression I had was this:
/\A[a-z][0-9]*[_].*[-][a-z]*[_].*[_]/i

that matched this:
X4540_oraldap1-sp_1.0.1.11.0.0_led_
instead of stopping at the underscore before 'led'.

However, I found a solution anyways, I put [0-9] after the '.*' and it stopped it at the last number before the underscore before 'led'

/\A[a-z][0-9]*[_].*[-][a-z]*[_].*[0-9][_]/i

This matches this:
X4540_oraldap1-sp_1.0.1.11.0.0_

Here is the whole expression now:
/\A[a-z][0-9]*[_].*[-][a-z]*[_].*[0-9][_][a-z]*[_][a-z]*\.[a-z]*/i

Thanks for the help anyways!

Last edited by RailsRhino (2010-07-21 10:08:35)

- Ben

Re: Regular expression

Maybe I failed to explain what I meant.

Yours: /\A[a-z][0-9]*[_].*[-][a-z]*[_].*[_]/i
Use: /\A[a-z][0-9]*[_].*[-][a-z]*[_][^_]*[_]/i instead

Testing it in irb:

irb(main):001:0> def showmatch(s, r)
irb(main):002:1> s =~ r
irb(main):003:1> "#{$`} << #{$~} >> #{$'}"
irb(main):004:1> end
=> nil
irb(main):005:0> s = 'X4540_oraldap1-sp_1.0.1.11.0.0_led_today.out'
=> "X4540_oraldap1-sp_1.0.1.11.0.0_led_today.out"
irb(main):006:0> r = /\A[a-z][0-9]*[_].*[-][a-z]*[_][^_]*[_]/i
=> /\A[a-z][0-9]*[_].*[-][a-z]*[_][^_]*[_]/i
irb(main):007:0> showmatch s, r
=> " << X4540_oraldap1-sp_1.0.1.11.0.0_ >> led_today.out"

Explanation:
Logically your regexp matches both 'X4540_oraldap1-sp_1.0.1.11.0.0_led_' and 'X4540_oraldap1-sp_1.0.1.11.0.0_' substrings. In this case ruby takes the longest match possible. Notation [^_]* means 'Any symbol except _ repeated 0 or more times'.

Last edited by bluesman.alex (2010-07-21 11:03:39)

Re: Regular expression

Ohh I see, you're right, that's the better way of doing it.

I was confused because everywhere I look it says '^' marks the start or beginning of a line. However, now I see somewhere that is says [^abc] = any character but abc.

Thanks so much!

Last edited by RailsRhino (2010-07-21 11:43:57)

- Ben

Re: Regular expression

to make this even nicer... how can I say (referring to '_1.0.1.11.0.0_') any character but '_' as long as it's a period or number. In other words, I want to make sure the section above is only numbers and periods until the coming underscore. So any character but an underscore AND those characters must be numbers or periods.

Something like:

[^_(\.|[0-9])]*

?

I'm sure that syntax is incorrect though...

Last edited by RailsRhino (2010-07-21 12:04:05)

- Ben

Re: Regular expression

RailsRhino wrote:

Ohh I see, you're right, that's the better way of doing it.

I was confused because everywhere I look it says '^' marks the start or beginning of a line. However, now I see somewhere that is says [^abc] = any character but abc.

Thanks so much!

fwiw i always found this annoying, ie that ^ means 'not' AND 'start of line'.  Although maybe on some level it is the same thing and i just never understood why smile

###########################################
#If i've helped you then please recommend me at Working With Rails:
#http://www.workingwithrails.com/person/ … i-williams

Re: Regular expression

RailsRhino wrote:

to make this even nicer... how can I say (referring to '_1.0.1.11.0.0_') any character but '_' as long as it's a period or number. In other words, I want to make sure the section above is only numbers and periods until the coming underscore. So any character but an underscore AND those characters must be numbers or periods.

Something like:

[^_(\.|[0-9])]*

?

I'm sure that syntax is incorrect though...

logically speaking, isn't

"any character but '_' as long as it's a period or number"

the same as saying

"a period or number"

?

###########################################
#If i've helped you then please recommend me at Working With Rails:
#http://www.workingwithrails.com/person/ … i-williams

Re: Regular expression

Haha, I was all caught up in using the '[^_]' that I didn't even realize that.

- Ben

Re: Regular expression

RailsRhino wrote:

Haha, I was all caught up in using the '[^_]' that I didn't even realize that.

smile

btw this is a nicer (or at least more conventional) way of saying 'any amount of periods or numbers':

[\.\d]*

###########################################
#If i've helped you then please recommend me at Working With Rails:
#http://www.workingwithrails.com/person/ … i-williams

Re: Regular expression

Alright thanks a lot!

- Ben

Re: Regular expression

Max Williams wrote:
RailsRhino wrote:

Haha, I was all caught up in using the '[^_]' that I didn't even realize that.

smile

btw this is a nicer (or at least more conventional) way of saying 'any amount of periods or numbers':

[\.\d]*

oops that "*" should have been a "+" since we want to match at least one period/number.

###########################################
#If i've helped you then please recommend me at Working With Rails:
#http://www.workingwithrails.com/person/ … i-williams

Re: Regular expression

Ohh yeah I also didn't realize '*' could be zero times, I need to change all my '*' to '+'.

Thanks again

- Ben

Re: Regular expression

Rhino

/^[a-z]\d+_\w+-[a-z]+_[.\d]+(?:_[a-z\d]+)+\.\w+$/i
I don't like using . in a regex, much less .*. As wild cards go, it's just a little too wild... it can be too greedy

Re: Regular expression

specious wrote:

Rhino

/^[a-z]\d+_\w+-[a-z]+_[.\d]+(?:_[a-z\d]+)+\.\w+$/i

I don't like using . in a regex, much less .*. As wild cards go, it's just a little too wild... it can be too greedy

Thanks for the refined code!

- Ben