Welcome Guest! To access all forums & features, please register an account or sign-in. → Why register?



Java Question (Should be easy)


12 replies to this topic - - - - -

#1 thatguyandrew1992

    Neowinian Senior

  • 2,079 posts
  • Joined: 22-January 09

Posted 25 April 2012 - 03:26

Could someone look at my code and tell me why findInHorizon is giving me more than what my regular expression is looking for? I cannot figure it out. I think when I take the file extension part out, it will work, but when it's there takes in that whole line with extra html code. Enter 2776 when the program starts. The program will download an HTML page and look through for my regular expression, but it isn't working! This code is basically another program I made that DID work, but now when I changed it to work for C&H it stopped :/

Source code-
http://pastebin.com/cxTv30yW

THANKS!


#2 CentralDogma

    Neowinian Senior

  • 2,142 posts
  • Joined: 29-February 08

Posted 25 April 2012 - 14:48

I don’t have the ability to compile your code right now, so I can’t give you a straight answer. But, I can help you to debug your code yourself!

What’s being stored in theImage after findWithinHorizon is run? What was the expected value?

#3 ZakO

    Resident Fanatic

  • 827 posts
  • Joined: 21-September 07

Posted 25 April 2012 - 15:07

I haven't tested it but I suspect it'll be because you're using a greedy match in your regex, change it to be non-greedy using ? after .*

theImage = sc2.findWithinHorizon("[h][t][t][p][:][/][/]www.explosm.net/db/files/Comics/.*?[.][jJpPgG][pPnNiI][eE]?[gGfF]", 0);

I've never used Java so I'm unfamiliar with its regex matcher but it looks like (?i) can make a part of a pattern case insensitive, so you could do something a bit easier to read for the extension too:

theImage = sc2.findWithinHorizon("www\\.explosm\\.net/db/files/Comics/.*?\\.(?i)(jpe?g|gif|png)", 0);


#4 OP thatguyandrew1992

    Neowinian Senior

  • 2,079 posts
  • Joined: 22-January 09

Posted 25 April 2012 - 16:33

EDIT****
.*?
I think this worked. I'll post back in a bit

View PostZakO, on 25 April 2012 - 15:07, said:

I haven't tested it but I suspect it'll be because you're using a greedy match in your regex, change it to be non-greedy using ? after .*

theImage = sc2.findWithinHorizon("[h][t][t][p][:][/][/]www.explosm.net/db/files/Comics/.*?[.][jJpPgG][pPnNiI][eE]?[gGfF]", 0);

I've never used Java so I'm unfamiliar with its regex matcher but it looks like (?i) can make a part of a pattern case insensitive, so you could do something a bit easier to read for the extension too:

theImage = sc2.findWithinHorizon("www\\.explosm\\.net/db/files/Comics/.*?\\.(?i)(jpe?g|gif|png)", 0);

I'll try that!

View PostCentralDogma, on 25 April 2012 - 14:48, said:

I don’t have the ability to compile your code right now, so I can’t give you a straight answer. But, I can help you to debug your code yourself!

What’s being stored in theImage after findWithinHorizon is run? What was the expected value?

This is what's being stored in theImage
http://pastebin.com/PirKMMXU
This is what I want
http://www.explosm.n...s/Kris/well.png

Why would it grab so much extra? Is it because that regex appears in that line again?

#5 OP thatguyandrew1992

    Neowinian Senior

  • 2,079 posts
  • Joined: 22-January 09

Posted 25 April 2012 - 16:39

This worked for part of it
.*?

Why did this fix it?

But then if I enter
2717 into the program, theImage doesn't grab anything :/

#6 ZakO

    Resident Fanatic

  • 827 posts
  • Joined: 21-September 07

Posted 25 April 2012 - 16:50

By default quantifiers in regex are greedy, adding ? makes them lazy.

For example: if you had the string abcdabc using a pattern of .*b would match up until the last b character: abcdab but using .*?b would give just ab

In your case it was matching the beginning of the URL, and then trying to find the last match for jpg, jpeg, gif or png. Adding the ? makes it stop on the first match.

There's a better explanation here: http://www.regular-e...nfo/repeat.html

#7 OP thatguyandrew1992

    Neowinian Senior

  • 2,079 posts
  • Joined: 22-January 09

Posted 25 April 2012 - 16:56

View PostZakO, on 25 April 2012 - 16:50, said:

By default quantifiers in regex are greedy, adding ? makes them lazy.

For example: if you had the string abcdabc using a pattern of .*b would match up until the last b character: abcdab but using .*?b would give just ab

In your case it was matching the beginning of the URL, and then trying to find the last match for jpg, jpeg, gif or png. Adding the ? makes it stop on the first match.
Oh that makes sense!

Ok so at page 2717 the link does NOT contain www. So I changed my reg expression

#8 OP thatguyandrew1992

    Neowinian Senior

  • 2,079 posts
  • Joined: 22-January 09

Posted 25 April 2012 - 19:57

EDIT;
I think this fixed it
finally{
//System.out.println("hi");
continue;
}

Ok so I have another question. Sometimes the pages numbers dont exist which causes an error. I try to catch the error, then move along. The catch block catches it but loops the catch. I added a scanner.next(); to capture the bad input. That will stop the catch from looping (or I guess the program from looping with bad input) but then it just stops in the catch and doesn't continue past the scanner.next(). What can I do? I setup the next link in the catch block so I need the the program to leave the catch block and start over at the top of the while loop.

http://pastebin.com/Kpajd6ER

#9 OP thatguyandrew1992

    Neowinian Senior

  • 2,079 posts
  • Joined: 22-January 09

Posted 25 April 2012 - 20:43

Ok finally got it working perfectly!
Source-
http://pastebin.com/SXpi6NLM
Jar-
http://www.mediafire...gsfp43oj1k88n31

#10 ZakO

    Resident Fanatic

  • 827 posts
  • Joined: 21-September 07

Posted 25 April 2012 - 21:27

I have to ask... where did you learn to put square brackets around almost every character in your regex pattern? You don't need to do that, it's just making it harder to read. Also, it doesn't matter much in your pattern, but you should be escaping the . in "explosm.net", the full-stop has a special meaning, match any character.

Good job overall though!

#11 OP thatguyandrew1992

    Neowinian Senior

  • 2,079 posts
  • Joined: 22-January 09

Posted 25 April 2012 - 21:34

View PostZakO, on 25 April 2012 - 21:27, said:

I have to ask... where did you learn to put square brackets around almost every character in your regex pattern? You don't need to do that, it's just making it harder to read. Also, it doesn't matter much in your pattern, but you should be escaping the . in "explosm.net", the full-stop has a special meaning, match any 1 character.

Good job overall though!
Are you referring to the http://www part? I did the www because not all the links have that. And I just did that to the http when I was trying to debug. i just let it like that. If you are referring to my file extensions, I think it's needed.

#12 ZakO

    Resident Fanatic

  • 827 posts
  • Joined: 21-September 07

Posted 25 April 2012 - 21:37

View Postthatguyandrew1992, on 25 April 2012 - 21:34, said:

Are you referring to the http://www part? I did the www because not all the links have that. And I just did that to the http when I was trying to debug. i just let it like that. If you are referring to my file extensions, I think it's needed.
I meant the http part. Square brackets only have meaning if you use them like [ab] which matches a or b, but you have it around single characters: [h][t][t][p][:][/][/]. It's not a problem, I was just interested because I've never seen anyone do that before.

#13 OP thatguyandrew1992

    Neowinian Senior

  • 2,079 posts
  • Joined: 22-January 09

Posted 26 April 2012 - 04:47

View PostZakO, on 25 April 2012 - 21:37, said:

I meant the http part. Square brackets only have meaning if you use them like [ab] which matches a or b, but you have it around single characters: [h][t][t][p][:][/][/]. It's not a problem, I was just interested because I've never seen anyone do that before.
Just a debugging thing but left it haha