Jump to content



Photo

microdata?

java

  • Please log in to reply
11 replies to this topic

#1 Original Poster

Original Poster

    C++ n00b

  • Tech Issues Solved: 1
  • Joined: 15-July 08
  • Location: my room
  • OS: windows 7, backtrack 5, OSx 10.6

Posted 21 December 2013 - 15:43

Hello,

 

I am trying to mine microdata from websites based on the scheme.org... but I am having trouble getting everything I want simply ...

as you can see below I am doing this in a very round about way but when I try and google it Im getting loads of trash that isnt helpful or is to much for what I require... I simple want to grab the microdata and thats it for now .... any help?

 

 

example:

while ((inputLine = in.readLine()) != null)//reads in the html page line by line
		{
			String inputLine2=inputLine.replaceAll("\"", "");//removes speechmarks from text
			String[] parts = inputLine2.split("<span itemprop=");//splits into sections based on it is after <span class="
			for (int loop=0;loop<parts.length;loop++)
			{
				System.out.println(parts[loop]);
				if (parts[loop].contains("servesCuisine>"))
				{
					System.out.println("ERROR CHECK"+parts[loop]);
					serves=parts[loop].substring(14, (parts[loop].length()));
				}
				if (parts[loop].contains("cell1>"))
				{
					rating=parts[loop].substring(0, (parts[loop].length()));
				}
			}
		}



#2 +Zlip792

Zlip792

    Neowinian Senior

  • Tech Issues Solved: 9
  • Joined: 31-October 10
  • Location: Pakistan
  • OS: Windows 8.1 Pro 64-bit
  • Phone: Nokia C3-00 (8.70 firmware) It sucks!!!

Posted 21 December 2013 - 15:54

Microdata to RDF output: http://www.w3.org/TR/microdata-rdf/

Microdata to RDF: http://www.w3.org/2012/pyMicrodata/

Google Tool: http://www.google.co...ls/richsnippetshttp://html5doctor.com/microdata/

 

Another HTML5 Microdata extractor: http://microdata-ext...improbable.org/ ; http://getschema.org...aextractor/test



#3 OP Original Poster

Original Poster

    C++ n00b

  • Tech Issues Solved: 1
  • Joined: 15-July 08
  • Location: my room
  • OS: windows 7, backtrack 5, OSx 10.6

Posted 22 December 2013 - 00:33

Thanks but these did not really touch on my problem domain. I know about the tools and about microdata... I am trying to write an application in java to extract it 



#4 +snaphat (Myles Landwehr)

snaphat (Myles Landwehr)

    Electrical & Computer Engineer

  • Tech Issues Solved: 29
  • Joined: 23-August 05
  • OS: Win/Lin/Bsd/Osx
  • Phone: dumb phone

Posted 22 December 2013 - 05:13

You should probably try a regex based approach.

 

Also, why are you reading & processing line by line? Is there a guarantee that tags are found on certain lines after you split by itemprop?



#5 OP Original Poster

Original Poster

    C++ n00b

  • Tech Issues Solved: 1
  • Joined: 15-July 08
  • Location: my room
  • OS: windows 7, backtrack 5, OSx 10.6

Posted 22 December 2013 - 23:39

You should probably try a regex based approach.

 

Also, why are you reading & processing line by line? Is there a guarantee that tags are found on certain lines after you split by itemprop?

the programming is reading each line untill it find itemprop and splits at that point...if not it does nothing .. this works on random sites with these in :)  



#6 +snaphat (Myles Landwehr)

snaphat (Myles Landwehr)

    Electrical & Computer Engineer

  • Tech Issues Solved: 29
  • Joined: 23-August 05
  • OS: Win/Lin/Bsd/Osx
  • Phone: dumb phone

Posted 23 December 2013 - 00:58

the programming is reading each line untill it find itemprop and splits at that point...if not it does nothing .. this works on random sites with these in :)

 

My point was that it looks like it only searches for servesCousine, and cell1 on the same line as itemprop was found on and I was wondering why that would be a valid assumption -- i.e. why isn't it possible for those tags to start on another line after itemprop is found?



#7 The_Decryptor

The_Decryptor

    STEAL THE DECLARATION OF INDEPENDENCE

  • Tech Issues Solved: 4
  • Joined: 28-September 02
  • Location: Sol System
  • OS: iSymbian 9.2 SP24.8 Mars Bar

Posted 23 December 2013 - 01:07

For starters you'll probably want to use a HTML parser, trying to parse HTML like that is just going to break.

Then, once you've got the HTML parsed into a tree, you can just iterate through nodes and pull off the nicely formatted attributes and values.

Edit: For example, this HTML fragment is perfectly valid, but would break your code.
 
<span itemprop="
test
test1>
test4">text
</span>


#8 OP Original Poster

Original Poster

    C++ n00b

  • Tech Issues Solved: 1
  • Joined: 15-July 08
  • Location: my room
  • OS: windows 7, backtrack 5, OSx 10.6

Posted 23 December 2013 - 06:39

For starters you'll probably want to use a HTML parser, trying to parse HTML like that is just going to break.

Then, once you've got the HTML parsed into a tree, you can just iterate through nodes and pull off the nicely formatted attributes and values.

Edit: For example, this HTML fragment is perfectly valid, but would break your code.
 

<span itemprop="
test
test1>
test4">text
</span>

I appreciate the help shall keep this in mind



#9 Lant

Lant

    Neowinian Senior

  • Joined: 13-April 06

Posted 23 December 2013 - 20:00

You should probably try a regex based approach.

 

Also, why are you reading & processing line by line? Is there a guarantee that tags are found on certain lines after you split by itemprop?

I believe stackoverflow states this better than me, but regex should not be used to parse HTML.



#10 +snaphat (Myles Landwehr)

snaphat (Myles Landwehr)

    Electrical & Computer Engineer

  • Tech Issues Solved: 29
  • Joined: 23-August 05
  • OS: Win/Lin/Bsd/Osx
  • Phone: dumb phone

Posted 23 December 2013 - 21:06

I believe stackoverflow states this better than me, but regex should not be used to parse HTML.

 

That link is essentially saying that you can't build a parser using regex for HTML as a language (the marked correct answer is actually doing a very poor job of that). The reason is that HTML has a context-free grammar and regex is strictly not a powerful enough construct for that *. Strictly speaking you would need some form of LR/LR parser for HTML (see wiki for these).

 

That being said, the OP is not trying to parse the language itself as a whole and so there is a distinct difference. He's just trying to find a few tidbits of information within a document and regex could very well be appropriate for that job. It really depends on the nature of what he wants to parse. In general, speaking from an engineering perspective, you should use the simplest possible solution. He shouldn't build an HTML parser if he doesn't have to. But, as The_Decryptor said, using an already existing HTML Parser is a good (better) alternative (and probably less work for the OP).

 

* Note: there are languages that can be generated and parsed by regex through. 



#11 The_Decryptor

The_Decryptor

    STEAL THE DECLARATION OF INDEPENDENCE

  • Tech Issues Solved: 4
  • Joined: 28-September 02
  • Location: Sol System
  • OS: iSymbian 9.2 SP24.8 Mars Bar

Posted 24 December 2013 - 05:21

Yeah, you don't want something overly complex, but at the same time the one constant about HTML is that authors screw it up, it's why the parsing rules for it are so arcane.

If you're dealing with random content from the wild I wouldn't bother trying to write my own parser, no matter what somebody will come up with a way to break it (Like including HTML unescaped in XML, or using invalid encodings so that you get presented with random bytes, missing tags, backwards declarations, nested tags that can't be nested, etc.) And it's easier to use a parser that works with their content, than trying to get them to fix their content ("But it works for us? it's your code that's broken" etc.)

Edit: I actually just remembered something related to this I saw recently, somebody saved a document with "smart quotes" as UTF-8, then loaded that as Latin-1, then converted it to Windows-1252, then saved that as UTF-8, and complained when the browser didn't render them properly.

#12 +snaphat (Myles Landwehr)

snaphat (Myles Landwehr)

    Electrical & Computer Engineer

  • Tech Issues Solved: 29
  • Joined: 23-August 05
  • OS: Win/Lin/Bsd/Osx
  • Phone: dumb phone

Posted 24 December 2013 - 05:25

Yeah, you don't want something overly complex, but at the same time the one constant about HTML is that authors screw it up, it's why the parsing rules for it are so arcane.

If you're dealing with random content from the wild I wouldn't bother trying to write my own parser, no matter what somebody will come up with a way to break it (Like including HTML unescaped in XML, or using invalid encodings so that you get presented with random bytes, missing tags, backwards declarations, nested tags that can't be nested, etc.) And it's easier to use a parser that works with their content, than trying to get them to fix their content ("But it works for us? it's your code that's broken" etc.)

 

These are great points!





Click here to login or here to register to remove this ad, it's free!