Can someone help me on this PDF issue?
Can anyone please help me on this??? Please read below and leave your suggestion. I would appreciate it. Some people have given some suggestions already, but it hasn’t worked yet… but thanks anyway! You can just write in the comments sections below this post. If you need to ask more questions or know more you can email Paris or Jean Ai… thank you…
All this is for my compilation of various prayers, sadhanas, and meditations I need to do. I want to make one master text. I have over 80 sadhanas to compile together into one master copy…
Tsem Rinpoche
Paris: jpfkhoo@yahoo.com and Jean Ai: elenakhong@gmail.com
Dear all,
This is a follow-up about converting PDF files. May I please clarify a few things, which will make the “search” for programmes more streamlined, and much clearer:
1. THE SUGGESTIONS SO FAR:
Jean Ai and team are already aware of programs that convert PDF to text files. HOWEVER, this works ONLY if the original file was a text file or contained text fields.
For example: It was originally a Microsoft Word file, and was then converted into a PDF. Or it was a Photoshop file, with a text field within the document.
2. OUR SITUATION:
Our situation is a bit different, as follows: The files we were working with are NOT originally text files. They are originally JPGs / images.
For example: For Rinpoche’s sadhana book, we do not have the original soft copy text files of these files. We scanned each page in as an image file. Therefore, the entire page – including the text – is recognised as an image.
Therefore:
The current suggestions that people have been giving will not work. They have been giving suggestions of programs that can only convert PDF files if they were already originally text.
3. WHAT WE NEED:
We will need some kind of magic program / software that will be able to extract text out from PDF files which were originally text. This will probably mean that the software will need some kind of in-built character recognition function, that can “read” the text and handwriting even if the whole file is actually a picture/image/jpg.
At the same time, this program should be able to extract out images within the image – this is important for things like symbols (e.g. bells, vajras) or hand-drawn images, or Tibetan script.
Essentially: the program will need to be intelligent enough to decipher what is actually text or an image. Then, extract / convert both text and images as separate edit-able components.
An illustrated example of what we are looking for is attached for your reference.
Please do explain this to friends and volunteers if they are offering suggestions, so that the search is more streamlined and we don’t spend time looking into programs that aren’t suitable for our purposes.
Thank you,
Paris
Please support us so that we can continue to bring you more Dharma:
If you are in the United States, please note that your offerings and contributions are tax deductible. ~ the tsemrinpoche.com blog team
Hi Rinpoche 🙂
I think solution for that is so called OCR software like:
– ABBYY FineReader Professional 11.0.102.481
– FreeOCR.net 3.0
– OmniPage Professional 18.0
– Presto! OCR Pro 4.0.2.40
They are not perfect but defiantly will help.
Best Regards
Gosia
I think the best software is to do it manually. Type all the words,scan all picture n paste it.
It is a lot of works but if we can do it in one group then nothing is impossible.
I m in if u need me. Never meet rinpoche before n never attend any talk fr him but likes his visions.
If all the sophisticated software fails then I would suggest the very down to earth method that is photocopy, cut, paste, scan and you get the copy you want.
There are many OCR (Optical Character Recognition) software programs.
I used Recognita on jpeg images with romanian special characters. There is another one supposed to be better called ABBY.
Very interesting situation indeed… I’ve never actually heard of converting images with text into texts, but apparently it’s possible.
Why not start the sadhana book from scratch? I know it may seem like a daunting task, but at least you can get it the way you want it to be.
Lets convert them manually then 🙂
Best way and the most accurate way compared to any other automated way.
Automated way isnt that good since you still need to proof read it again. So, do it manually correct and get it right the first time.
Best software I know for this is called Able2Extract (http://www.investintech.com/able2extract.html). Will convert pdf to .txt, .doc, .xls or .html . The software uses very clever OCR and keeps images as images but converts any text (typed or handwritten) to text. Seems to be exactly what you need, but you can check features here – http://www.investintech.com/products/desktop/a2e/features/ .
Powerful settings for flexible use, but user-friendly at the same time – for example, you can select which areas to ‘OCR’ and which to ignore. The OCR is much more accurate than others I’ve used – especially with handwriting.
Hope this helps,
Stuart.
Hi Stuart, thanks so much for your suggestion! I’ve just run the programme on five different samples of Rinpoche’s sadhana book and unfortunately, it doesn’t pick up the Tibetan script or Rinpoche’s handwritten notes…it turns it into a bunch of random symbols 🙁 I can imagine though, that the programme would be quite useful in other circumstances, it’s got much better character recognition than many other OCR programmes I’ve tested! Thanks again though, for the suggestion…it was definitely much closer to anything we’ve previously used 🙂
hi there!!!! pdf file can copy and paste in autocad file then use autocad to trace it out or else get a free trial program ‘pdf file convert to word or autocad drw’.
i’m sorry i dun reli get wat u wan but just now i try is at your example i right hand click, then copy image and paste into word or excel then trace it out at there..it able to do it but just need a bit of time. if u wan the special hand writing to be an image then use paint brush to create then save and insert to word or excel as insert picture. either pfd file or jpeg file it can be copy and paste into word or excel or use print screen to paste to paint brush then cut out thing u wan.
i do not know 1 sadhanas is how many or how big file but if thing i suggest u think it work then work together in group to type or trace out all the sadhanas.
Hi there, I´m student of Software Engineering … the technology to convert text images to editable text like pdf is OCR as some people comments, but its not very accurate and its mostly for computer prints or machines that uses this kind of fonts, as far as I know there isn’t handwritting OCR’s that are trustable, I found some on Internet (I never used them), but the comments of these software are not good as they say only works acceptably with handwriting that is very similar to computer fonts, and in the example posted I don’t think there is any similarity. You could try, but I really don’t think it will work 🙁
I use a software called ABByy OCR. It can capture handwriting but it may not be accurate.
You can use Google Docs, upload all images into google Docs and you can use it to read the text and seperate the pictures.
The Tutorial is here.
http://googledocs.blogspot.com/2010/06/optical-character-recognition-ocr-in.html#!/2010/06/optical-character-recognition-ocr-in.html
Rinpoche, I have a program named “Preview” on my new iMac.
You can “Open” a .jpg and then “Export” the .jpg choosing the PDF format.
If you want to e-mail me a test .jog, I will open it in Preview and Export it as a PDF and send it back to you.
John Kernell
Hi John, thank you for your suggestion but Rinpoche does not like PDF.
Hi Jean Ai,
I am not good in IT, but I am thinking that since “Preview” can do JPG->PDF. Can we do this round and then convert the PDF files to text (as you have said that there is no problem converting docs from/to PDF).
This sounds like double job but if worst scenario no other good/accurate software is available, we may have to do this, yes?
I am exploring other possible options and will email you once I have any good findings.
Hi Springflower, as the blog post above notes, you can only convert back into text if the original was in text form.
Since our original is NOT in text form (but in JPG form), you cannot convert PDF files into text using what we have. Thanks for your suggestion though 🙂
Paris – Unfortunately, it may not be possible to do this. OCR programs can work, but they are tricky and are far from being perfect. Most things converted this way will still need (usually) massive amounts of manual editing. Plus, OCR programs are very bad a converting handwritten text to type.
The first link offers a free service, I haven’t used it before but you can try it out http://www.free-ocr.com/
Dear Paris, it may sound silly but have you looked into “image to text” on Google? The technology is called OCR (Optical Recognition Tool) and many used it to upload torrents online. There are many software like that in its category which people use to upload informational torrents with