Browsing Erkki’s diary with the help of Paul Hough

During a large part of his life Kurenniemi has produced a diary. Under different forms, written by hand in notebooks, on his computer, in audio cassettes and video. I recently opened a notebook exported in pdf from mathematica conceived by Stephen Wolfram. We can read Erkki’s stream of thoughts about maths, drawing, women, technology, sex … This notebook also contains hand drawings saved as vectors. In one entry, Erkki mentions freehand, but we don’t know if he refers to the program or to a free hand technique

Pdf is a very comfortable format for reading a text. To interpret it, to index its content, to perform manipulations on it, it is another story.
Several tools, like pdftotext allow to extract the textual content out of a pdf to make it searchable for instance.

Out of curiosity, I was looking for a way to detect in which pages of the pdf he had inserted images. Using the command pdftoimages, I extracted all pages of a pdf and saved them as bitmaps.  I decided to read the pdf in the company of opencv. Within Opencv, I used a function, cvHoughLines2, that allows to track optimally the different lines present in an image. This function is named after a patent introduced in 1962 by Paul Hough and later refined, improved and transformed several times before having the form we know today in computer vision.

Line detection

“In automated analysis of digital images, a subproblem often arises of detecting simple shapes, such as straight lines, circles or ellipses. In many cases an edge detector can be used as a pre-processing stage to obtain image points or image pixels that are on the desired curve in the image space. Due to imperfections in either the image data or the edge detector, however, there may be missing points or pixels on the desired curves as well as spatial deviations between the ideal line/circle/ellipse and the noisy edge points as they are obtained from the edge detector. For these reasons, it is often non-trivial to group the extracted edge features to an appropriate set of lines, circles or ellipses. The purpose of the Hough transform is to address this problem by making it possible to perform groupings of edge points into object candidates by performing an explicit voting procedure over a set of parameterized image objects (Shapiro and Stockman, 304).”
http://en.wikipedia.org/wiki/Hough_transform

The presence of non-horizontal lines give me a hint to select the pages with images. In the last pictures, the pages that only contain horzontal lines are discarded (strikethrough).

line detection and selection

Code

/* This is a standalone program. Pass an image name as a first parameter
of the program.  Switch between standard and probabilistic Hough transform
by changing "#if 1" to "#if 0" and back */
#include 
#include 
#include 
#include

using namespace std;

int main(int argc, char** argv)
{
    int linesdetected, linesthreshold;
    linesthreshold=3;
    linesdetected=0;
    IplImage* src;
    if( argc >1 && (src=cvLoadImage(argv[1], 0))!= 0)
    {
        IplImage* dst = cvCreateImage( cvGetSize(src), 8, 1 );
        IplImage* color_dst = cvCreateImage( cvGetSize(src), 8, 3 );
        CvMemStorage* storage = cvCreateMemStorage(0);
        CvSeq* lines = 0;
        int i;
        cvCanny( src, dst, 50, 200, 3 );
        cvCvtColor( dst, color_dst, CV_GRAY2BGR );
#if 0
        lines = cvHoughLines2( dst,
                               storage,
                               CV_HOUGH_STANDARD,
                               1,
                               CV_PI/180,
                               100,
                               0,
                               0 );

        for( i = 0; i < MIN(lines->total,100); i++ )
        {
            float* line = (float*)cvGetSeqElem(lines,i);
            float rho = line[0];
            float theta = line[1];
            CvPoint pt1, pt2;
            double a = cos(theta), b = sin(theta);
            double x0 = a*rho, y0 = b*rho;
            pt1.x = cvRound(x0 + 1000*(-b));
            pt1.y = cvRound(y0 + 1000*(a));
            pt2.x = cvRound(x0 - 1000*(-b));
            pt2.y = cvRound(y0 - 1000*(a));
            cvLine( color_dst, pt1, pt2, CV_RGB(255,0,0), 3, 8 );
        }
#else
        lines = cvHoughLines2( dst,
                               storage,
                               CV_HOUGH_PROBABILISTIC,
                               1,
                               CV_PI/180,
                               80,
                               30,
                               10 );
        for( i = 0; i < lines->total; i++ )
        {
            CvPoint* line = (CvPoint*)cvGetSeqElem(lines,i);
            int diffy;
            if(line[0].y>line[1].y){
                diffy=line[0].y-line[1].y;
            }else{
                diffy=line[1].y-line[0].y;
            }
            if(diffy>50){
            cvLine( color_dst, line[0], line[1], CV_RGB(255,255,0), 1, CV_AA);
            linesdetected++;
            //cout << "linesdetected: " << linesdetected <width;
        cross2.y=color_dst->height;
        cross3.x=color_dst->width;
        cross3.y=0;
        cross4.x=0;
        cross4.y=color_dst->height;
//        cout << "cross1 x: " << cross1.x << " cross1 y: " << cross1.y <