ENH: Allow text extraction to keep intendation #2054

MartinThoma · 2023-08-01T05:27:22Z

When we extract Python code from a PDF, it's completely messed up. It would be nice to have an option that keeps the indentation. Maybe a flag for a layout-mode?

Code Example: How the new feature could be used

from pypdf import PdfReader

# https://arxiv.org/pdf/1601.03642.pdf
reader = PdfReader("1601.03642.pdf")
print(reader.pages[6].extract_text(layout_mode=True))

should give:

 * Increment the size file of the new incorrect UI_FILTER group information
 * of the size generatively.
 */
static int indicate_policy(void)
{
    int error;
    if (fd == MARN_EPT) {
        /*
         * The kernel blank will coeld it to userspace.
         */
        if (ss->segment < mem_total)
            unblock_graph_and_set_blocked();
        else
            ret = 1;
        goto bail;
    }
    segaddr = in_SB(in.addr);
    selector = seg / 16;
    setup_works = true;
    for (i = 0; i < blocks; i++) {
        seq = buf[i++];
        bpf = bd->bd.next + i * search;
        if (fd) {
            current = blocked;
        }
    }
    rw->name = "Getjbbregs";
    bprm_self_clearl(&iv->version);
    regs->new = blocks[(BPF_STATS << info->historidac)] | PFMR_CLOBATHINC_SECONDS << 12;
    return segtable;
}


D. Linux Code, 2

/*
* Copyright (c) 2006-2010, Intel Mobile Communications. All rights reserved.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License version 2 as published by
* the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
*
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software Foundation,
* Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
*/

#include <linux/kexec.h>
#include <linux/errno.h>
#include <linux/io.h>
#include <linux/platform_device.h>
#include <linux/multi.h>

Currently, we get:

*Increment the size file of the new incorrect UI_FILTER group information
*of the size generatively.
*/
static int indicate_policy(void)
{
int error;
if (fd == MARN_EPT) {
/*
*The kernel blank will coeld it to userspace.
*/
if (ss->segment < mem_total)
unblock_graph_and_set_blocked();
else
ret = 1;
goto bail;
}
segaddr = in_SB(in.addr);
selector = seg / 16;
setup_works = true;
for (i = 0; i < blocks; i++) {
seq = buf[i++];
bpf = bd->bd.next + i *search;
if (fd) {
current = blocked;
}
}
rw->name = "Getjbbregs";
bprm_self_clearl(&iv->version);
regs->new = blocks[(BPF_STATS << info->historidac)] | PFMR_CLOBATHINC_SECONDS << 12;
return segtable;
}
D. Linux Code, 2
/*
*Copyright (c) 2006-2010, Intel Mobile Communications. All rights reserved.
*
* This program is free software; you can redistribute it and/or modify it
*under the terms of the GNU General Public License version 2 as published by
*the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
*but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
*
*GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software Foundation,
*Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
*/
#include <linux/kexec.h>
#include <linux/errno.h>
#include <linux/io.h>
#include <linux/platform_device.h>
#include <linux/multi.h>

The text was updated successfully, but these errors were encountered:

MrAnayDongre · 2023-08-03T06:40:25Z

I'd be interested in contributing to this enhancement for PyPDF2 @MartinThoma.
Let me know how I can be of assistance

MartinThoma · 2023-08-03T15:21:19Z

@MrAnayDongre PyPDF2 is deprecated. This is going into pypdf.

This is a very complex feature. I don't know myself by know what would be a good way to start doing that.

If you want to start contributing to pypdf, I recommend to have a look at Easy This issue is a good starting point for first-time contributors , then at help wanted We appreciate help everywhere - this one might be an easy start!

pubpub-zz · 2024-04-08T21:25:33Z

extract_text has now layout extraction_mode
I close this now old covered issue

stefan6419846 · 2024-04-09T05:38:36Z

@pubpub-zz The layout mode does not resolve this and this issue requires further work to convert horizontal positions into whitespace accordingly.

I have therefore re-opened this issue.

pubpub-zz · 2024-04-09T17:48:11Z

@stefan6419846, this is is the rendering:
print(rr.pages[6].extract_text(extraction_mode="layout"))
->

 * Increment  the  size  file  of  the  new  incorrect  UI_FILTER  group  information
 * of  the  size  generatively.
 */
static  int  indicate_policy(void)
{
   int  error;
   if  (fd  ==  MARN_EPT)  {
     /*
       * The  kernel  blank  will  coeld  it  to  userspace.
       */
     if  (ss->segment  <  mem_total)
        unblock_graph_and_set_blocked();
     else
        ret  =  1;
     goto  bail;
   }
   segaddr  =  in_SB(in.addr);
   selector  =  seg  /  16;
   setup_works  =  true;
   for  (i  =  0;  i  <  blocks;  i++)  {
     seq  =  buf[i++];
     bpf  =  bd->bd.next  +  i  * search;
     if  (fd)  {
        current  =  blocked;
     }
   }
   rw->name  =  "Getjbbregs";
   bprm_self_clearl(&iv->version);
   regs->new  =  blocks[(BPF_STATS  <<  info->historidac)]  |  PFMR_CLOBATHINC_SECONDS  <<  12;
   return  segtable;
}


D. Linux Code, 2

/*
 *   Copyright  (c)  2006-2010,  Intel  Mobile  Communications.   All  rights  reserved.
 *
 *     This  program  is  free  software;  you  can  redistribute  it  and/or  modify  it
 * under  the  terms  of  the  GNU  General  Public  License  version  2  as  published  by
 * the  Free  Software  Foundation.
 *
 *               This  program  is  distributed  in  the  hope  that  it  will  be  useful,
 * but  WITHOUT  ANY  WARRANTY;  without  even  the  implied  warranty  of
 *     MERCHANTABILITY  or  FITNESS  FOR  A  PARTICULAR  PURPOSE.   See  the
 *
 *   GNU  General  Public  License  for  more  details.
 *
 *     You  should  have  received  a  copy  of  the  GNU  General  Public  License
 *       along  with  this  program;  if  not,  write  to  the  Free  Software  Foundation,
 *   Inc.,  675  Mass  Ave,  Cambridge,  MA  02139,  USA.
 */

#include  <linux/kexec.h>
#include  <linux/errno.h>
#include  <linux/io.h>
#include  <linux/platform_device.h>
#include  <linux/multi.h>

Isn't this good ?

stefan6419846 · 2024-04-09T17:55:48Z

Sorry, seems like my checkout was somehow broken. Still not optimal, but yes, then we can close this for now.

MartinThoma self-assigned this Aug 1, 2023

MartinThoma added is-feature A feature request workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow labels Aug 1, 2023

stefan6419846 unassigned MartinThoma Feb 20, 2024

pubpub-zz closed this as completed Apr 8, 2024

stefan6419846 reopened this Apr 9, 2024

stefan6419846 closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Allow text extraction to keep intendation #2054

ENH: Allow text extraction to keep intendation #2054

MartinThoma commented Aug 1, 2023

MrAnayDongre commented Aug 3, 2023

MartinThoma commented Aug 3, 2023

pubpub-zz commented Apr 8, 2024

stefan6419846 commented Apr 9, 2024

pubpub-zz commented Apr 9, 2024

stefan6419846 commented Apr 9, 2024

ENH: Allow text extraction to keep intendation #2054

ENH: Allow text extraction to keep intendation #2054

Comments

MartinThoma commented Aug 1, 2023

Code Example: How the new feature could be used

MrAnayDongre commented Aug 3, 2023

MartinThoma commented Aug 3, 2023

pubpub-zz commented Apr 8, 2024

stefan6419846 commented Apr 9, 2024

pubpub-zz commented Apr 9, 2024

stefan6419846 commented Apr 9, 2024