Google adds OCR for PDFs and images

Google has added a new feature to their Google Docs platform; OCR for PDF and images files. OCR or optical character recognition will read the the text embedded into a PDF or image so that you can edit the document in a normal text editor.

OCR is not completely accurate though. Even if the text in the image was generated by a computer originally there can be artifacts or letters that seems to run together when the OCR reader is scanning the document which can lead to errors in the conversion. Most of the time though, the majority of the text is converted accurately so the amount of text you have to fix is minimal making the service a pretty good time saver.

When uploading documents to your Google Docs space you will now be asked if you would like to 'Convert text from PDF or image files to Google Docs documents.' The Unofficial news and tips about Google blog tested the new service (pictured below) and found that about 10% of the text was incorrectly converted and the formatting was not preserved.

Google docs PDF conversion

Report a problem with article
Previous Story

Kinect price confirmed, will sell for $149.99 US

Next Story

Kinect still has a lag issue; Jimmy Fallon confirms

19 Comments

Commenting is disabled on this article.

I"m not sure what Evernote uses but it has some sort of OCR that seems near 100% for uploaded pictures. I'm sortof surprised Google hasn't bought them yet it seems like they would fit well.

Crytel said,
I"m not sure what Evernote uses but it has some sort of OCR that seems near 100% for uploaded pictures. I'm sortof surprised Google hasn't bought them yet it seems like they would fit well.

They do have OCR, I didn't know that. They don't call it OCR and you have to read there features carefully to notice. They ought to emphasize that a little more, that is a huge plus to an already great service.

And yeah we can always expect Google to improve upon this service, now that they have started it....important thing was the START !!

90% IS pity damn good considering you will need one less software on ur system for the task and one less headache......And That's a first kinda initiative in the industry too...so a big heads up to GOOGLE !!!!

flash_flicker said,
90% IS pity damn good considering you will need one less software on ur system for the task and one less headache......And That's a first kinda initiative in the industry too...so a big heads up to GOOGLE !!!!

Uh if there's free software (which there is) that does OCR better than Google (no facts on that yet), then why not install the possibly better software? I doubt you haven't tried all the OCR software that exists, so you can't say which is better either. I would get a headache from seeing only 90% accuracy.

Oh and "initiative", huh?? OCR's been around for over 3 years I believe. First time I used it was 2 years ago. Google's showing initiative to catch up maybe, and they're doing it badly too.

andrewbares said,

Uh if there's free software (which there is) that does OCR better than Google (no facts on that yet), then why not install the possibly better software? I doubt you haven't tried all the OCR software that exists, so you can't say which is better either. I would get a headache from seeing only 90% accuracy.

Oh and "initiative", huh?? OCR's been around for over 3 years I believe. First time I used it was 2 years ago. Google's showing initiative to catch up maybe, and they're doing it badly too.

hehe i never said that Google is doing better than the industry stalwarts or something and OCR has been there for more than 8-10 years bro.....I was just being appreciative of the fact that they have actually gone ahead and integrated it into their system (and that's the FIRST)..No wonder its accuracy is only 90% or even lower....You don't use it for the professional reasons in any case...in that case it's wise to go for professional softwares (or good freeware alternatives) which need to be installed first !! Peace...

If the document is the same resolution as it is in this article, then I would say 90% is pretty damn good. That image is terrible.

afaik :there are not a free software for ocr and the best one are quite expensive. So, may be it is not a good alternative for people that already own a ocr program but it is fine for the rest of person.

Magallanes said,
afaik :there are not a free software for ocr and the best one are quite expensive. So, may be it is not a good alternative for people that already own a ocr program but it is fine for the rest of person.

There ARE free OCR programs. Go to Download.com and do a bit of searching. I've never paid a cent for scanning software, but I got a legal free OCR software that I use. Here that, FREE.

I don't expect it to keep formatting but only 90% success? I thought we had tech that would easily put that in the 99% range unless the scanned document is a FAX from 1995.

I wonder why it fails so hard on that example text. An OCR engine in 2010 shouldn't struggle with a typical serif font without any interference like that. *shrug*

Commenting on myself now, but can't edit... I just thought of that maybe that text used a very low resolution. That's the only explanation I can come up with. Either that, or their detection rate simply sucks.

you are quite correct. the problem lies with the resolution. I use OCR software quite a lot. The only time I get 99% accuracy is when using 300dpi for scanned documents or books. if you have an average all in one machine scanning at 300 dpi is slow and painful. When I scan at 150 dpi due to shortage of time (because of the scan speed) the accuracy for OCR is a miserable 70%...

90% accuracy is horrible in this day and age. I'm sure there's a little company they could've bought for a couple million that's already producing near 100% accuracy. 99% is the standard for computer generated documents.
I like the idea; I have to scan something with OCR once a year which doesn't justify going out and buying the software, and there's not a lot of free solutions out there. But this is substandard quality even if it's free. Google should pull this and bring it up to par with other solutions.

17 mistakes in the second document

I think it would be quicker to actually write it out from start, than go through fixing all the errors.

Hopefully it will get better as more people learn, seems to struggle with "your"