• 0

Java Question (Should be easy)


Question

Could someone look at my code and tell me why findInHorizon is giving me more than what my regular expression is looking for? I cannot figure it out. I think when I take the file extension part out, it will work, but when it's there takes in that whole line with extra html code. Enter 2776 when the program starts. The program will download an HTML page and look through for my regular expression, but it isn't working! This code is basically another program I made that DID work, but now when I changed it to work for C&H it stopped :/

Source code-

http://pastebin.com/cxTv30yW

THANKS!

Link to comment
https://www.neowin.net/forum/topic/1072709-java-question-should-be-easy/
Share on other sites

12 answers to this question

Recommended Posts

  • 0

I haven't tested it but I suspect it'll be because you're using a greedy match in your regex, change it to be non-greedy using ? after .*

theImage = sc2.findWithinHorizon("[h][t][t][p][:][/][/]www.explosm.net/db/files/Comics/.*?[.][jJpPgG][pPnNiI][eE]?[gGfF]", 0);[/CODE]

I've never used Java so I'm unfamiliar with its regex matcher but it looks like (?i) can make a part of a pattern case insensitive, so you could do something a bit easier to read for the extension too:

[CODE]theImage = sc2.findWithinHorizon("www\\.explosm\\.net/db/files/Comics/.*?\\.(?i)(jpe?g|gif|png)", 0);[/CODE]

  • 0

EDIT****

.*?

I think this worked. I'll post back in a bit

I haven't tested it but I suspect it'll be because you're using a greedy match in your regex, change it to be non-greedy using ? after .*

theImage = sc2.findWithinHorizon("[h][t][t][p][:][/][/]www.explosm.net/db/files/Comics/.*?[.][jJpPgG][pPnNiI][eE]?[gGfF]", 0);[/CODE]

I've never used Java so I'm unfamiliar with its regex matcher but it looks like (?i) can make a part of a pattern case insensitive, so you could do something a bit easier to read for the extension too:

[CODE]theImage = sc2.findWithinHorizon("www\\.explosm\\.net/db/files/Comics/.*?\\.(?i)(jpe?g|gif|png)", 0);[/CODE]

I'll try that!

I don?t have the ability to compile your code right now, so I can?t give you a straight answer. But, I can help you to debug your code yourself!

What?s being stored in theImage after findWithinHorizon is run? What was the expected value?

This is what's being stored in theImage

http://pastebin.com/PirKMMXU

This is what I want

http://www.explosm.n...s/Kris/well.png

Why would it grab so much extra? Is it because that regex appears in that line again?

  • 0

By default quantifiers in regex are greedy, adding ? makes them lazy.

For example: if you had the string abcdabc using a pattern of .*b would match up until the last b character: abcdab but using .*?b would give just ab

In your case it was matching the beginning of the URL, and then trying to find the last match for jpg, jpeg, gif or png. Adding the ? makes it stop on the first match.

There's a better explanation here: http://www.regular-expressions.info/repeat.html

  • 0

By default quantifiers in regex are greedy, adding ? makes them lazy.

For example: if you had the string abcdabc using a pattern of .*b would match up until the last b character: abcdab but using .*?b would give just ab

In your case it was matching the beginning of the URL, and then trying to find the last match for jpg, jpeg, gif or png. Adding the ? makes it stop on the first match.

Oh that makes sense!

Ok so at page 2717 the link does NOT contain www. So I changed my reg expression

  • 0

EDIT;

I think this fixed it

finally{

//System.out.println("hi");

continue;

}

Ok so I have another question. Sometimes the pages numbers dont exist which causes an error. I try to catch the error, then move along. The catch block catches it but loops the catch. I added a scanner.next(); to capture the bad input. That will stop the catch from looping (or I guess the program from looping with bad input) but then it just stops in the catch and doesn't continue past the scanner.next(). What can I do? I setup the next link in the catch block so I need the the program to leave the catch block and start over at the top of the while loop.

http://pastebin.com/Kpajd6ER

  • 0

I have to ask... where did you learn to put square brackets around almost every character in your regex pattern? You don't need to do that, it's just making it harder to read. Also, it doesn't matter much in your pattern, but you should be escaping the . in "explosm.net", the full-stop has a special meaning, match any character.

Good job overall though!

  • 0

I have to ask... where did you learn to put square brackets around almost every character in your regex pattern? You don't need to do that, it's just making it harder to read. Also, it doesn't matter much in your pattern, but you should be escaping the . in "explosm.net", the full-stop has a special meaning, match any 1 character.

Good job overall though!

Are you referring to the http://www part? I did the www because not all the links have that. And I just did that to the http when I was trying to debug. i just let it like that. If you are referring to my file extensions, I think it's needed.

  • 0

Are you referring to the http://www part? I did the www because not all the links have that. And I just did that to the http when I was trying to debug. i just let it like that. If you are referring to my file extensions, I think it's needed.

I meant the http part. Square brackets only have meaning if you use them like [ab] which matches a or b, but you have it around single characters: [h][t][t][p][:][/][/]. It's not a problem, I was just interested because I've never seen anyone do that before.

  • 0

I meant the http part. Square brackets only have meaning if you use them like [ab] which matches a or b, but you have it around single characters: [h][t][t][p][:][/][/]. It's not a problem, I was just interested because I've never seen anyone do that before.

Just a debugging thing but left it haha

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • Still 3x what it should cost. So, it seems the trick is to increase price by 6x so that a reduction in price back to 4x looks like a steal. "You savvy shoppers win again!" I'm glad I'm not in a desperate spot to actually even need this overpriced crap. Hopefully, it comes back down by the time for when (or if) I ever do.
    • Although AI is great and has it's use cases they likely have massively overhyped it and it has not delivered as per their expectations. I fully expect them to start saying the same things again when it does get to a certain level of intelligence!
    • Microsoft wants to end printer driver headaches with Windows Ready Print by Usama Jawad A few days ago, Microsoft released Windows 11 Experimental build 26300.8553, bringing a ton of enhancements such as Start menu customization, search improvements, Taskbar polish, and other minor UI tweaks. Another relatively major enhancement snuck deep within the change log was related to upgrades to the Windows printing experience. Now, Microsoft has shared more details about these benefits. For starters, Microsoft has renamed its Modern Print Platform to Windows Ready Print. The company believes that this name highlights its shift in strategy, which now focuses on modernizing, securing, and streamlining the printing experience for Windows devices. Some of the upgrades present in Windows Ready Print have already been seeded to customers and partners. This includes ending support for third-party printer drivers via Windows Update and transitioning towards the Internet Printing Protocol (IPP) and the native Windows IPP printer driver. In line with these changes, new printer installations will default to Windows Ready Print on eligible devices starting from July 2026. However, Microsoft recognizes that not all environments will be able to migrate to this platform immediately, so it will allow users to choose between installing the printer via Windows Ready Print or the traditional OEM process. Users will be able to toggle this configuration through Settings > Bluetooth & Devices > Printers & Scanners > Printer preferences. This control applies only to new printer installations, and its functionality can also be modified via Group Policy as follows: Launch Group Policy Editor Navigate to Local Computer Policy -> Administrative Templates -> Printers Find and select 'Configure Windows Ready Print driver ranking' -> double click to open it Select 'Enabled' (if you wish to enable Windows Ready Print driver selection) or 'Disabled' (if you wish to explicitly disable Windows Ready Print driver selection). Select Apply Select OK Similarly, if you set up Windows protected print mode through the same setting in Windows 11, it will also default to using Windows Ready Print exclusively. Microsoft hopes that these improvements will help eradicate dependency on OEM-specific driver installation processes and simplify printer installations. We'll likely find out more about other tangible benefits in the coming months.
    • Hey what's about the proton vpn firefox extension ? It's not working today
  • Recent Achievements

    • One Year In
      Primer1st earned a badge
      One Year In
    • Experienced
      JayZJay went up a rank
      Experienced
    • Reacting Well
      Sir_Timbit earned a badge
      Reacting Well
    • Week One Done
      rubentuben8 earned a badge
      Week One Done
    • Week One Done
      ARaclen earned a badge
      Week One Done
  • Popular Contributors

    1. 1
      +primortal
      513
    2. 2
      PsYcHoKiLLa
      229
    3. 3
      Edouard
      138
    4. 4
      ATLien_0
      87
    5. 5
      Steven P.
      81
  • Tell a friend

    Love Neowin? Tell a friend!