Introduction
When working with CSV files, you may come across a CSV file that cannot be imported directly into a database because of issues, such as line endings or character set encodings. You may also be missing information that is required to import the file, such as the file’s delimiter or the data types of the fields in the file. You can use the Civis API to preprocess the file and prepare it for import and solve all of these challenges.
File Properties - Standardization
The API standardizes the below file properties. The resultant file:
- Will be compressed with Gzip.
- Will not have a byte-order marker (BOM).
- Will have UNIX-style line endings (“\n”) or Windows-style line endings (“\r\n”), rather than older generation Mac-style line endings (“\r”).
- Will be UTF-8 encoded.
The API does not fix any data quality issues (e.g., quoting of special characters, differences in the number of columns across rows).
File Properties - Detection
The API detects the following file properties:
- If the file contains a header row. Multi-row headers are not supported.
- File field delimiter. Delimiters supported for detection are comma (“,”), tab (“\t”) and pipe (“|”).
- Data types and names (if a header row is provided) or all fields.
Using the API Endpoint
Preprocess CSV can be created by sending a `POST` request to `/files/preprocess/csv`. For additional information on available parameters, please refer to the Civis Platform API Documentation. Helper functions are available in the Civis Python and R clients.
It is important to note that this endpoint only creates the preprocess csv job in Platform. It does not run the job. To run the newly created job, you will need to send a `POST` request to the `/jobs/{id}/runs` endpoint, providing the `id` of the newly created job. For more information on this endpoint, see the Civis Platform API Documentation, as well as the Python client and R client documentation. If you use the Python client, you can also import the `civis.utils` module and use the `civis.utils.run_job` helper function.
Example
The following example uses the Civis Python client to preprocess a file and then retrieve the detected information.
import civis
import civis.utils
#1. Create the preprocess job
client = civis.APIClient()
print('Posting job...')
job_response = client.files.post_preprocess_csv(
file_id=1234,
in_place=False,
detect_table_columns=True
)
print('Success! Job id is {}'.format(job_response.id))
#2. Run the preprocess job
print('Running job...')
future = civis.utils.run_job(job_response.id, client=client)
future.result()
print('Success! Job id: {0} Run id: {1}'.format(future.job_id, future.run_id))
#3. Retrieve the new file id
new_file = client.jobs.list_runs_outputs(future.job_id, future.run_id)[0]
new_file_id = new_file.object_id
print('New file id is {}'.format(new_file_id))
#4. Retrieve the detected information
detected_info = client.files.get(new_file_id).detected_info
print('Detected: \n{}'.format(detected_info))
Comments
0 comments
Please sign in to leave a comment.