• 0

Java Question (Should be easy)


Question

Could someone look at my code and tell me why findInHorizon is giving me more than what my regular expression is looking for? I cannot figure it out. I think when I take the file extension part out, it will work, but when it's there takes in that whole line with extra html code. Enter 2776 when the program starts. The program will download an HTML page and look through for my regular expression, but it isn't working! This code is basically another program I made that DID work, but now when I changed it to work for C&H it stopped :/

Source code-

http://pastebin.com/cxTv30yW

THANKS!

Link to comment
https://www.neowin.net/forum/topic/1072709-java-question-should-be-easy/
Share on other sites

12 answers to this question

Recommended Posts

  • 0

I haven't tested it but I suspect it'll be because you're using a greedy match in your regex, change it to be non-greedy using ? after .*

theImage = sc2.findWithinHorizon("[h][t][t][p][:][/][/]www.explosm.net/db/files/Comics/.*?[.][jJpPgG][pPnNiI][eE]?[gGfF]", 0);[/CODE]

I've never used Java so I'm unfamiliar with its regex matcher but it looks like (?i) can make a part of a pattern case insensitive, so you could do something a bit easier to read for the extension too:

[CODE]theImage = sc2.findWithinHorizon("www\\.explosm\\.net/db/files/Comics/.*?\\.(?i)(jpe?g|gif|png)", 0);[/CODE]

  • 0

EDIT****

.*?

I think this worked. I'll post back in a bit

I haven't tested it but I suspect it'll be because you're using a greedy match in your regex, change it to be non-greedy using ? after .*

theImage = sc2.findWithinHorizon("[h][t][t][p][:][/][/]www.explosm.net/db/files/Comics/.*?[.][jJpPgG][pPnNiI][eE]?[gGfF]", 0);[/CODE]

I've never used Java so I'm unfamiliar with its regex matcher but it looks like (?i) can make a part of a pattern case insensitive, so you could do something a bit easier to read for the extension too:

[CODE]theImage = sc2.findWithinHorizon("www\\.explosm\\.net/db/files/Comics/.*?\\.(?i)(jpe?g|gif|png)", 0);[/CODE]

I'll try that!

I don?t have the ability to compile your code right now, so I can?t give you a straight answer. But, I can help you to debug your code yourself!

What?s being stored in theImage after findWithinHorizon is run? What was the expected value?

This is what's being stored in theImage

http://pastebin.com/PirKMMXU

This is what I want

http://www.explosm.n...s/Kris/well.png

Why would it grab so much extra? Is it because that regex appears in that line again?

  • 0

By default quantifiers in regex are greedy, adding ? makes them lazy.

For example: if you had the string abcdabc using a pattern of .*b would match up until the last b character: abcdab but using .*?b would give just ab

In your case it was matching the beginning of the URL, and then trying to find the last match for jpg, jpeg, gif or png. Adding the ? makes it stop on the first match.

There's a better explanation here: http://www.regular-expressions.info/repeat.html

  • 0

By default quantifiers in regex are greedy, adding ? makes them lazy.

For example: if you had the string abcdabc using a pattern of .*b would match up until the last b character: abcdab but using .*?b would give just ab

In your case it was matching the beginning of the URL, and then trying to find the last match for jpg, jpeg, gif or png. Adding the ? makes it stop on the first match.

Oh that makes sense!

Ok so at page 2717 the link does NOT contain www. So I changed my reg expression

  • 0

EDIT;

I think this fixed it

finally{

//System.out.println("hi");

continue;

}

Ok so I have another question. Sometimes the pages numbers dont exist which causes an error. I try to catch the error, then move along. The catch block catches it but loops the catch. I added a scanner.next(); to capture the bad input. That will stop the catch from looping (or I guess the program from looping with bad input) but then it just stops in the catch and doesn't continue past the scanner.next(). What can I do? I setup the next link in the catch block so I need the the program to leave the catch block and start over at the top of the while loop.

http://pastebin.com/Kpajd6ER

  • 0

I have to ask... where did you learn to put square brackets around almost every character in your regex pattern? You don't need to do that, it's just making it harder to read. Also, it doesn't matter much in your pattern, but you should be escaping the . in "explosm.net", the full-stop has a special meaning, match any character.

Good job overall though!

  • 0

I have to ask... where did you learn to put square brackets around almost every character in your regex pattern? You don't need to do that, it's just making it harder to read. Also, it doesn't matter much in your pattern, but you should be escaping the . in "explosm.net", the full-stop has a special meaning, match any 1 character.

Good job overall though!

Are you referring to the http://www part? I did the www because not all the links have that. And I just did that to the http when I was trying to debug. i just let it like that. If you are referring to my file extensions, I think it's needed.

  • 0

Are you referring to the http://www part? I did the www because not all the links have that. And I just did that to the http when I was trying to debug. i just let it like that. If you are referring to my file extensions, I think it's needed.

I meant the http part. Square brackets only have meaning if you use them like [ab] which matches a or b, but you have it around single characters: [h][t][t][p][:][/][/]. It's not a problem, I was just interested because I've never seen anyone do that before.

  • 0

I meant the http part. Square brackets only have meaning if you use them like [ab] which matches a or b, but you have it around single characters: [h][t][t][p][:][/][/]. It's not a problem, I was just interested because I've never seen anyone do that before.

Just a debugging thing but left it haha

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • I agree when are you going to read this (really poor BTW) article?
    • I disagree; they come off very "bitchy" and "whiny". Make a great product and combine that with a great price (free) and people will come over to your side. Or build it and they will come as they say. Constantly trying to get attention by complaining all the time, will turn people off to your product.
    • It use to be a nightmare, with LibreOffice supporting a newer draft ODF standard by default, and Microsoft Office supporting the older non-draft standard. Now that they both support the same version of ODF, they should be interoperable.
    • Brave Browser 1.91.171 by Razvan Serea Brave Browser is a lightning-fast, secure web browser that stands out from the competition with its focus on privacy, security, and speed. With features like HTTPS Everywhere and built-in tracker blocking, Brave keeps your online activities safe from prying eyes. Brave is one of the safest browsers on the market today. It blocks third-party data storage. It protects from browser fingerprinting. And it does all this by default. Speed - Brave is built on Chromium, the same technology that powers Google Chrome, and is optimized for speed, providing a fast and responsive browsing experience. Brave Browser also features Brave Rewards, a system that rewards users with Basic Attention Tokens (BAT) for viewing opt-in ads. This innovative system provides an alternative revenue model for content creators and a way to support the Brave community. SlimBrave Neo takes all the good things about Brave and makes them even better by keeping everything clean, light, and privacy-focused. It removes the extra clutter, turns off features you might not need, and cuts down on anything that could slow you down or collect unnecessary data. Because it relies on simple settings and policies instead of modifying the browser itself, you still get full Brave compatibility—just in a smoother, lighter, and more privacy-friendly package. Brave Browser 1.91.171 changelog: General Fixed Cardano not being disabled on upgrade to Brave Origin. Upgraded Chromium to 149.0.7827.103. Origin Removed “Survey Panelist” setting from brave://settings/privacy. Fixed P3A and usage ping under brave://settings/privacy being displayed on first launch on Linux. Upgraded Chromium to 149.0.7827.103. Download: Brave Browser 64-bit | 1.2 MB (Freeware) Download: Brave Browser 32-bit View: Brave Homepage | Offline Installers | Screenshot Get alerted to all of our Software updates on Twitter at @NeowinSoftware
    • Hi. As the title suggests, I can't access the forum on my phone. I'm using Edge on Android and when I try to navigate to the forum I get a "we value your privacy" popup and none of the buttons are clickable. It effectively stonewalls me from reading any forum content.
  • Recent Achievements

    • Rookie
      Marzoid went up a rank
      Rookie
    • Community Regular
      coch went up a rank
      Community Regular
    • One Year In
      slackerzz earned a badge
      One Year In
    • One Year In
      highriskpaym earned a badge
      One Year In
    • One Month Later
      highriskpaym earned a badge
      One Month Later
  • Popular Contributors

    1. 1
      +primortal
      519
    2. 2
      PsYcHoKiLLa
      190
    3. 3
      +Edouard
      156
    4. 4
      Steven P.
      84
    5. 5
      ATLien_0
      75
  • Tell a friend

    Love Neowin? Tell a friend!