• 0

microdata?


Question

Hello,

 

I am trying to mine microdata from websites based on the scheme.org... but I am having trouble getting everything I want simply ...

as you can see below I am doing this in a very round about way but when I try and google it Im getting loads of trash that isnt helpful or is to much for what I require... I simple want to grab the microdata and thats it for now .... any help?

 

 

example:

while ((inputLine = in.readLine()) != null)//reads in the html page line by line
		{
			String inputLine2=inputLine.replaceAll("\"", "");//removes speechmarks from text
			String[] parts = inputLine2.split("<span itemprop=");//splits into sections based on it is after <span class="
			for (int loop=0;loop<parts.length;loop++)
			{
				System.out.println(parts[loop]);
				if (parts[loop].contains("servesCuisine>"))
				{
					System.out.println("ERROR CHECK"+parts[loop]);
					serves=parts[loop].substring(14, (parts[loop].length()));
				}
				if (parts[loop].contains("cell1>"))
				{
					rating=parts[loop].substring(0, (parts[loop].length()));
				}
			}
		}
Link to comment
Share on other sites

11 answers to this question

Recommended Posts

  • 0
  • 0

Thanks but these did not really touch on my problem domain. I know about the tools and about microdata... I am trying to write an application in java to extract it 

Link to comment
Share on other sites

  • 0

You should probably try a regex based approach.

 

Also, why are you reading & processing line by line? Is there a guarantee that tags are found on certain lines after you split by itemprop?

Link to comment
Share on other sites

  • 0

You should probably try a regex based approach.

 

Also, why are you reading & processing line by line? Is there a guarantee that tags are found on certain lines after you split by itemprop?

the programming is reading each line untill it find itemprop and splits at that point...if not it does nothing .. this works on random sites with these in :)  

Link to comment
Share on other sites

  • 0

the programming is reading each line untill it find itemprop and splits at that point...if not it does nothing .. this works on random sites with these in :)

 

My point was that it looks like it only searches for servesCousine, and cell1 on the same line as itemprop was found on and I was wondering why that would be a valid assumption -- i.e. why isn't it possible for those tags to start on another line after itemprop is found?

Link to comment
Share on other sites

  • 0

For starters you'll probably want to use a HTML parser, trying to parse HTML like that is just going to break.

Then, once you've got the HTML parsed into a tree, you can just iterate through nodes and pull off the nicely formatted attributes and values.

Edit: For example, this HTML fragment is perfectly valid, but would break your code.

 

<span itemprop="
test
test1>
test4">text
</span>
Link to comment
Share on other sites

  • 0

For starters you'll probably want to use a HTML parser, trying to parse HTML like that is just going to break.

Then, once you've got the HTML parsed into a tree, you can just iterate through nodes and pull off the nicely formatted attributes and values.

Edit: For example, this HTML fragment is perfectly valid, but would break your code.

 

<span itemprop="
test
test1>
test4">text
</span>

I appreciate the help shall keep this in mind

Link to comment
Share on other sites

  • 0

You should probably try a regex based approach.

 

Also, why are you reading & processing line by line? Is there a guarantee that tags are found on certain lines after you split by itemprop?

I believe stackoverflow states this better than me, but regex should not be used to parse HTML.

Link to comment
Share on other sites

  • 0

I believe stackoverflow states this better than me, but regex should not be used to parse HTML.

 

That link is essentially saying that you can't build a parser using regex for HTML as a language (the marked correct answer is actually doing a very poor job of that). The reason is that HTML has a context-free grammar and regex is strictly not a powerful enough construct for that *. Strictly speaking you would need some form of LR/LR parser for HTML (see wiki for these).

 

That being said, the OP is not trying to parse the language itself as a whole and so there is a distinct difference. He's just trying to find a few tidbits of information within a document and regex could very well be appropriate for that job. It really depends on the nature of what he wants to parse. In general, speaking from an engineering perspective, you should use the simplest possible solution. He shouldn't build an HTML parser if he doesn't have to. But, as The_Decryptor said, using an already existing HTML Parser is a good (better) alternative (and probably less work for the OP).

 

* Note: there are languages that can be generated and parsed by regex through. 

Link to comment
Share on other sites

  • 0

Yeah, you don't want something overly complex, but at the same time the one constant about HTML is that authors screw it up, it's why the parsing rules for it are so arcane.

If you're dealing with random content from the wild I wouldn't bother trying to write my own parser, no matter what somebody will come up with a way to break it (Like including HTML unescaped in XML, or using invalid encodings so that you get presented with random bytes, missing tags, backwards declarations, nested tags that can't be nested, etc.) And it's easier to use a parser that works with their content, than trying to get them to fix their content ("But it works for us? it's your code that's broken" etc.)

Edit: I actually just remembered something related to this I saw recently, somebody saved a document with "smart quotes" as UTF-8, then loaded that as Latin-1, then converted it to Windows-1252, then saved that as UTF-8, and complained when the browser didn't render them properly.

  • Like 1
Link to comment
Share on other sites

  • 0

Yeah, you don't want something overly complex, but at the same time the one constant about HTML is that authors screw it up, it's why the parsing rules for it are so arcane.

If you're dealing with random content from the wild I wouldn't bother trying to write my own parser, no matter what somebody will come up with a way to break it (Like including HTML unescaped in XML, or using invalid encodings so that you get presented with random bytes, missing tags, backwards declarations, nested tags that can't be nested, etc.) And it's easier to use a parser that works with their content, than trying to get them to fix their content ("But it works for us? it's your code that's broken" etc.)

 

These are great points!

Link to comment
Share on other sites

This topic is now closed to further replies.