Display and Extract Text from PDFs in React.js Using `react-pdf`

429 views

To read text from a PDF file in a React.js application, you can use libraries like pdfjs-dist or react-pdf. Below, I’ll guide you through a simple example using the react-pdf library which is a wrapper for pdfjs-dist. This allows you to easily read and display PDF files. For extracting text specifically, you may need to rely directly on pdfjs-dist.

Reading PDF using react-pdf

  1. Install Dependencies

    First, you need to install react-pdf and pdfjs-dist.

    npm install @react-pdf-viewer/core
    npm install pdfjs-dist
    
  2. Set Up Component

    Create a component that will load and display the PDF file.

    // src/components/PdfReader.js
    import React, { useState } from 'react';
    import { Document, Page } from 'react-pdf/dist/esm/entry.webpack'; // Import PDF components
    
    const PdfReader = () => {
      const [file, setFile] = useState(null);
      const [numPages, setNumPages] = useState(null);
    
      const onDocumentLoadSuccess = ({ numPages }) => {
        setNumPages(numPages);
      };
    
      const handleFileChange = (event) => {
        setFile(event.target.files[0]);
      };
    
      return (
        <div>
          <input type="file" onChange={handleFileChange} />
          {file && (
            <Document file={file} onLoadSuccess={onDocumentLoadSuccess}>
              {Array.from(new Array(numPages), (el, index) => (
                <Page key={`page_${index + 1}`} pageNumber={index + 1} />
              ))}
            </Document>
          )}
        </div>
      );
    };
    
    export default PdfReader;
    
  3. Use the Component

    Use the PdfReader component in your application.

    // src/App.js
    import React from 'react';
    import PdfReader from './components/PdfReader';
    
    function App() {
      return (
        <div className="App">
          <h1>PDF Reader</h1>
          <PdfReader />
        </div>
      );
    }
    
    export default App;
    

Extracting Text from PDF

If you specifically need to extract text content from PDF, you might work directly with pdfjs-dist as react-pdf is more tailored to viewing.

  1. Read PDF using pdfjs-dist

    Using pdfjs-dist, you can access the PDF content and extract text.

    import React, { useState } from 'react';
    import * as pdfjsLib from 'pdfjs-dist';
    
    pdfjsLib.GlobalWorkerOptions.workerSrc = `//cdnjs.cloudflare.com/ajax/libs/pdf.js/${pdfjsLib.version}/pdf.worker.js`;
    
    const PdfTextExtractor = () => {
      const [text, setText] = useState("");
    
      const extractText = async (file) => {
        const fileReader = new FileReader();
        fileReader.onload = async function() {
          const typedarray = new Uint8Array(this.result);
          const pdf = await pdfjsLib.getDocument(typedarray).promise;
    
          let extractedText = '';
          for (let i = 1; i <= pdf.numPages; i++) {
            const page = await pdf.getPage(i);
            const textContent = await page.getTextContent();
            extractedText += textContent.items.map(item => item.str).join(' ');
          }
          setText(extractedText);
        };
        fileReader.readAsArrayBuffer(file);
      };
      
      const handleFileChange = (event) => {
        extractText(event.target.files[0]);
      };
      
      return (
        <div>
          <input type="file" onChange={handleFileChange} />
          <div>
            <h3>Extracted Text:</h3>
            <p>{text}</p>
          </div>
        </div>
      );
    };
    
    export default PdfTextExtractor;
    
  2. Use the Text Extractor Component

    Integrate this into your app similarly to how you set up the PdfReader.

Using either approach, you should now be able to either view or extract text from a PDF file in your React.js application. Note that working with PDFs can be resource-intensive, and some complex PDFs might not render precisely due to limitations in text extraction libraries.