Deduplicate CSV data in JavaScript before database insert

Duplicate rows in CSV files cause real problems: inflated analytics, failed unique constraints, wasted storage, and corrupted data pipelines. Before inserting CSV data into your database, you need to deduplicate it.

This guide covers every approach to deduplicating CSV data in JavaScript, from basic exact matching to streaming large files that would crash a naive implementation. Each example is complete and runnable.

Prerequisites

Node.js 18+
npm or yarn
Basic familiarity with JavaScript/TypeScript

What you'll learn

By the end of this tutorial, you'll know how to:

Remove exact duplicate rows using Set and Map
Deduplicate based on specific columns (email, ID, composite keys)
Choose between keeping the first or last occurrence
Handle large CSV files with streaming
Avoid common pitfalls like case sensitivity and whitespace

Step 1: Project setup

Install PapaParse for CSV parsing:

npm install papaparse
npm install --save-dev @types/papaparse

For Node.js streaming with large files, also install:

npm install csv-parser csv-writer

Step 2: Basic exact duplicate removal

The fastest way to remove duplicates from an array in JavaScript is using a Set. For primitive values (strings, numbers), this is straightforward:

const values = ['apple', 'banana', 'apple', 'cherry', 'banana'];
const unique = [...new Set(values)];
// ['apple', 'banana', 'cherry']

For CSV rows (which are objects), Set doesn't work directly because it compares object references, not values. You need to create a string key from each row:

import Papa from 'papaparse';

interface CsvRow {
  name: string;
  email: string;
  phone: string;
}

function deduplicateExact(rows: CsvRow[]): CsvRow[] {
  const seen = new Set<string>();
  return rows.filter(row => {
    const key = JSON.stringify(row);
    if (seen.has(key)) return false;
    seen.add(key);
    return true;
  });
}

// Usage with PapaParse
Papa.parse<CsvRow>(csvFile, {
  header: true,
  complete: (results) => {
    const uniqueRows = deduplicateExact(results.data);
    console.log(`Removed ${results.data.length - uniqueRows.length} duplicates`);
  }
});

This approach has O(n) time complexity, which is 440-800x faster than using filter + findIndex for datasets with 500+ rows.

Step 3: Column-based deduplication

Often you need to deduplicate based on a specific column rather than the entire row. For example, keeping only one row per email address:

interface UserRow {
  id: string;
  name: string;
  email: string;
  created_at: string;
}

function deduplicateByKey<T extends Record<string, unknown>>(
  rows: T[],
  key: keyof T
): T[] {
  const seen = new Set<string>();
  return rows.filter(row => {
    const value = String(row[key] ?? '').toLowerCase().trim();
    if (!value || seen.has(value)) return false;
    seen.add(value);
    return true;
  });
}

// Keep first occurrence of each email
const uniqueByEmail = deduplicateByKey(users, 'email');

Notice the normalization: toLowerCase().trim(). This handles common cases where John@Example.com and john@example.com should be treated as duplicates.

Step 4: Multi-column composite key deduplication

Sometimes duplicates are defined by a combination of columns. For instance, a contact might be duplicate if both email AND phone match:

function deduplicateByKeys<T extends Record<string, unknown>>(
  rows: T[],
  keys: (keyof T)[]
): T[] {
  const seen = new Set<string>();
  return rows.filter(row => {
    const compositeKey = keys
      .map(k => String(row[k] ?? '').toLowerCase().trim())
      .join('|');
    if (seen.has(compositeKey)) return false;
    seen.add(compositeKey);
    return true;
  });
}

// Deduplicate where both email AND phone match
const unique = deduplicateByKeys(contacts, ['email', 'phone']);

The pipe character | separates key parts. Choose a delimiter that won't appear in your data.

Step 5: Keep first vs keep last occurrence

By default, filter keeps the first occurrence. To keep the last occurrence instead (useful when you want the most recent record), use a Map:

function deduplicateKeepLast<T extends Record<string, unknown>>(
  rows: T[],
  key: keyof T
): T[] {
  const map = new Map<string, T>();
  rows.forEach(row => {
    const value = String(row[key] ?? '').toLowerCase().trim();
    if (value) {
      map.set(value, row); // Overwrites previous, keeping last
    }
  });
  return [...map.values()];
}

// Keep the most recent record for each email
const latestRecords = deduplicateKeepLast(users, 'email');

With a Map, each subsequent duplicate overwrites the previous one, so you end up with the last occurrence.

Step 6: Streaming for large CSV files

Loading a multi-gigabyte CSV file into memory causes "JavaScript heap out of memory" errors. For large files, process data as a stream:

import * as fs from 'fs';
import csv from 'csv-parser';
import { createObjectCsvWriter } from 'csv-writer';

interface Row {
  [key: string]: string;
}

async function deduplicateLargeFile(
  inputPath: string,
  outputPath: string,
  keyColumn: string
): Promise<number> {
  const seen = new Set<string>();
  const uniqueRows: Row[] = [];
  let headers: string[] = [];

  return new Promise((resolve, reject) => {
    fs.createReadStream(inputPath)
      .pipe(csv())
      .on('headers', (h: string[]) => {
        headers = h;
      })
      .on('data', (row: Row) => {
        const key = row[keyColumn]?.toLowerCase().trim();
        if (key && !seen.has(key)) {
          seen.add(key);
          uniqueRows.push(row);
        }
      })
      .on('end', async () => {
        if (uniqueRows.length === 0) {
          resolve(0);
          return;
        }

        const writer = createObjectCsvWriter({
          path: outputPath,
          header: headers.map(id => ({ id, title: id }))
        });
        await writer.writeRecords(uniqueRows);
        resolve(uniqueRows.length);
      })
      .on('error', reject);
  });
}

// Usage
const count = await deduplicateLargeFile(
  'large-contacts.csv',
  'deduplicated-contacts.csv',
  'email'
);
console.log(`Wrote ${count} unique rows`);

This approach keeps only the unique keys in memory (as strings in a Set), not the entire file content.

Step 7: SHA-256 hashing for many columns

When rows have many columns and you need exact-match deduplication, JSON.stringify can become expensive. For better performance with wide tables, use a hash:

import { createHash } from 'crypto';

function deduplicateWithHash<T>(rows: T[]): T[] {
  const seen = new Set<string>();
  return rows.filter(row => {
    const hash = createHash('sha256')
      .update(JSON.stringify(row))
      .digest('hex');
    if (seen.has(hash)) return false;
    seen.add(hash);
    return true;
  });
}

The fixed-length hash uses less memory than storing full JSON strings for each unique row.

Complete example

Here's a complete, runnable example that combines parsing and deduplication:

import Papa from 'papaparse';

interface ContactRow {
  id: string;
  name: string;
  email: string;
  phone: string;
  company: string;
}

type DedupeStrategy = 'first' | 'last';

interface DedupeOptions {
  keys: (keyof ContactRow)[];
  strategy: DedupeStrategy;
  normalize: boolean;
}

function deduplicateCsv(
  rows: ContactRow[],
  options: DedupeOptions
): ContactRow[] {
  const { keys, strategy, normalize } = options;

  const getKey = (row: ContactRow): string => {
    return keys
      .map(k => {
        let value = String(row[k] ?? '');
        if (normalize) {
          value = value.toLowerCase().trim();
        }
        return value;
      })
      .join('|');
  };

  if (strategy === 'last') {
    const map = new Map<string, ContactRow>();
    rows.forEach(row => {
      const key = getKey(row);
      if (key) map.set(key, row);
    });
    return [...map.values()];
  }

  // strategy === 'first'
  const seen = new Set<string>();
  return rows.filter(row => {
    const key = getKey(row);
    if (!key || seen.has(key)) return false;
    seen.add(key);
    return true;
  });
}

// Example usage
const csvData = `id,name,email,phone,company
1,John Doe,john@example.com,555-1234,Acme Inc
2,Jane Smith,jane@example.com,555-5678,Beta Corp
3,John Doe,JOHN@EXAMPLE.COM,555-9999,Acme Inc
4,Bob Wilson,bob@example.com,555-1111,Gamma LLC
5,Jane Smith,jane@example.com,555-5678,Beta Corp`;

Papa.parse<ContactRow>(csvData, {
  header: true,
  skipEmptyLines: true,
  complete: (results) => {
    const unique = deduplicateCsv(results.data, {
      keys: ['email'],
      strategy: 'first',
      normalize: true
    });

    console.log('Original rows:', results.data.length);
    console.log('Unique rows:', unique.length);
    console.log('Duplicates removed:', results.data.length - unique.length);

    unique.forEach(row => {
      console.log(`${row.name} - ${row.email}`);
    });
  }
});

Output:

Original rows: 5
Unique rows: 3
Duplicates removed: 2
John Doe - john@example.com
Jane Smith - jane@example.com
Bob Wilson - bob@example.com

Performance comparison

Approach	Time Complexity	Memory	Best For
Set (primitives)	O(n)	O(n)	Simple value arrays
Set + JSON.stringify	O(n)	O(n)	Exact row matching
Map (objects)	O(n)	O(n)	Keep first/last by key
filter + findIndex	O(n^2)	O(1)	Small datasets only
Streaming + Set	O(n)	O(unique keys)	Large files
SHA-256 hash	O(n)	O(n)	Wide tables, many columns

The Set-based approach is 440-800x faster than filter + findIndex for arrays with 500+ items. For production use, always prefer O(n) approaches.

Common pitfalls

Case sensitivity

Problem: john@example.com and John@Example.com are treated as different.

// Wrong
const key = row.email;

// Correct
const key = row.email.toLowerCase();

Whitespace variations

Problem: "john@example.com" and "john@example.com " are treated as different.

// Wrong
const key = row.email.toLowerCase();

// Correct
const key = row.email.toLowerCase().trim();

Null and undefined values

Problem: Rows with missing values create invalid keys or throw errors.

// Wrong - throws if email is undefined
const key = row.email.toLowerCase();

// Correct - handles missing values
const key = (row.email ?? '').toLowerCase().trim();
// Or filter them out entirely
if (!row.email) return false;

Object reference comparison

Problem: Set compares object references, not values.

// Wrong - doesn't work for objects
const unique = [...new Set(rows)];

// Correct - create string keys
const seen = new Set<string>();
const unique = rows.filter(row => {
  const key = JSON.stringify(row);
  if (seen.has(key)) return false;
  seen.add(key);
  return true;
});

Memory overflow with large files

Problem: Loading entire file causes heap overflow.

// Wrong - loads everything into memory
const data = fs.readFileSync('huge.csv', 'utf-8');
const rows = Papa.parse(data).data;

// Correct - use streaming
fs.createReadStream('huge.csv')
  .pipe(csv())
  .on('data', (row) => { /* process incrementally */ });

The easier way: ImportCSV

Writing deduplication logic from scratch requires handling all these edge cases: normalization, streaming, column selection, and error handling. ImportCSV handles this automatically.

With ImportCSV, you get:

Automatic duplicate detection during import
Visual review of duplicates before insertion
Column-based deduplication rules
Built-in streaming for large files
No custom code required

import { CSVImporter } from '@importcsv/react';

function ContactImporter() {
  return (
    <CSVImporter
      onComplete={(data) => {
        // Data is already deduplicated
        console.log('Unique contacts:', data.rows.length);
      }}
    />
  );
}